Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Ops for solve_discrete_lyapunov and solve_continuous_lyapunov #1020

Merged
merged 1 commit into from
Nov 24, 2022

Conversation

jessegrabowski
Copy link
Contributor

Following our discussions in #1015 and #1011, this PR adds Ops to aesara.tensor.slinalg that wrap scipy.linalg.solve_discrete_lyapunov and scipy.linalg.solve_continuous_lyapunov, as well as compute the reverse-mode gradients for each.

One note, I had an error from mypy when running the pre-commit hooks:
aesara\link\c\cmodule.py:2439: error: Unused "type: ignore" comment

But this is unrelated to any changes i made, so I didn't know what to do with it.

Copy link
Contributor

@ricardoV94 ricardoV94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. I have only a couple of minor questions.

aesara/tensor/slinalg.py Outdated Show resolved Hide resolved
aesara/tensor/slinalg.py Outdated Show resolved Hide resolved
Copy link
Member

@brandonwillard brandonwillard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the implementation here.

@aesara-devs/aesara, as always, we should take a quick look at the underlying NumPy/SciPy implementations to see whether or not something can be implemented using existing Aesara Ops.

@jessegrabowski
Copy link
Contributor Author

While the direct solver can be written elegantly using only aesara Ops, I still think there is value-add to handing the forward part to scipy. As noted in the scipy documentation, the direct solver scales very poorly with the size of the matrices. Even at N=50, the bilinear solver is two orders of magnitude faster than the direct solver (180 ms vs <1ms). The memory footprint is also much lower, because a Kronecker product is not required. Here is a graph comparing performance of the Op from this PR, using the bilinear solver, to the pure aesara direct solver.

image

Sizes larger than N=250 could not be allocated in memory for the direct solution. In the application I am thinking about (kalman filtering), this routine would need to be called once per logp evaluation, so speed is important.

I realize this is an unfair comparison, and that the real question is whether the bilinear solver could be directly implemented in Aesara. It would first require an implementation of the Schur decomposition (imaginary part is discarded, so this is not an issue), then an implementation of a Sylvester equation solver, which would in turn require an efficient solution routine like trsyl, otherwise we run into the direct solver problem again. Or, decide that having access to trsyl is good, so wrap the scipy sylvester equation solver, then call that from a semi-pure aesara implementation of solve_lyapunov, but this strikes me as arbitrary, especially if the schur decomposition ended up as a scipy wrapper as well.

@brandonwillard
Copy link
Member

brandonwillard commented Jul 1, 2022

I realize this is an unfair comparison, and that the real question is whether the bilinear solver could be directly implemented in Aesara.

There's no need to justify the inclusion of a relevant SciPy feature, and, yes, the question is whether or not the other estimation approach that is supported by SciPy can be implemented with existing Aesara Ops, like the direct approach.

If need be, a custom Op can always be used for only the other approach, but it's more important that we adequately assess an Aesara-based implementation and determine what's missing, if anything.

@brandonwillard
Copy link
Member

For reference, the other discrete approach (i.e. "bilinear") appears to be here, and, from a quick glance, it looks to be well covered by existing Ops.

@jessegrabowski
Copy link
Contributor Author

The relevant lines I see as being not covered are here, inside the solve_continuous_lyapunov function:

    # Compute the Schur decomposition form of a
    r, u = schur(a, output='real')

    # Construct f = u'*q*u
    f = u.conj().T.dot(q.dot(u))

    # Call the Sylvester equation solver
    trsyl = get_lapack_funcs('trsyl', (r, f))

@brandonwillard
Copy link
Member

It would first require an implementation of the Schur decomposition (imaginary part is discarded, so this is not an issue), then an implementation of a Sylvester equation solver, which would in turn require an efficient solution routine like trsyl, otherwise we run into the direct solver problem again. Or, decide that having access to trsyl is good, so wrap the scipy sylvester equation solver, then call that from a semi-pure aesara implementation of solve_lyapunov, but this strikes me as arbitrary, especially if the schur decomposition ended up as a scipy wrapper as well.

Yes, a Shur decomposition and trsyl Op are needed, and why would adding them be "arbitrary"?

@jessegrabowski
Copy link
Contributor Author

"Arbitrary" here refers to the criterion for writing an Op wrapper around a given scipy function, but the word was chosen more out of my ignorance of the criterion than any capriciousness (or lack thereof) in the criterion itself. What I am gleaning is that the ideal criteria for a custom scipy Op is, "as close to LAPACK as possible"? One could get more atomic than that, but I doubt it would be productive.

I am also struggling with the merits and demerits of having analytic gradient expressions. My naive perspective is that, even if we had implementations of schur and trsyl, it would still be better to manually implement the lyapunov ops, because we have an analytic expression of the reverse-mode gradients. We relieve the system of the need to compute these, and side-step any potential issues arising from approximation.

@brandonwillard
Copy link
Member

"Arbitrary" here refers to the criterion for writing an Op wrapper around a given scipy function, but the word was chosen more out of my ignorance of the criterion than any capriciousness (or lack thereof) in the criterion itself. What I am gleaning is that the ideal criteria for a custom scipy Op is, "as close to LAPACK as possible"? One could get more atomic than that, but I doubt it would be productive.

Ah, yeah, the "depth" at which something should be implemented is always measured relative to its complexity and ancillary benefits. As you've noticed, this library's current "depth" is somewhere around the BLAS/LAPACK level. As a result, the complexity of implementing something at that level is relatively low, and there are multiple examples illustrating how to implement and test such Ops (e.g. see aesara.tensor.blas). These examples make good use of the old C backend, so they also serve as good templates for making performant Ops.

Those new Ops can also be implemented in exactly the same way as the current ones in this PR (e.g. without C, Numba, and/or JAX implementations). We can always return to them later and add other backend implementations, if need be. Likewise, their gradients shouldn't be any more complicated than the current ones.

Work at this "depth" provides two important ancillary benefits that are exclusive to this approach:

  • two new low-level Op implementations that can be used to implement other "compound" functions, and
  • access to optimizations on the intermediate Ops used by the compound implementations.

Our ability to reason about expressions and manipulate them in an automated fashion is what allows many of the performance and stability optimizations that set this project apart from plain NumPy and SciPy–as well as other tensor libraries. Redundant, high-level Ops that combine the functionality of existing Ops effectively hide information from Aesara that could be used to perform optimizations.

Consider an Op that is equivalent to $f(x) = \exp(x^2)$. If a graph is compiled that represents the expression $\log \exp(x^2)$, then Aesara is able to "analytically" reduce that expression to $x^2$ using one simple rewrite. If a computationally equivalent graph representing $\log f(x)$ is compiled, then an unnecessary exponential function evaluation is required.

One could construct new rewrites explicitly for the redundant Op, but those rewrites would be redundant themselves and only add to the development and testing effort. The same goes for transpilation support (e.g. C, Numba, JAX).

In this situation, redundancy begets redundancy, so, unless the benefits of introducing redundancy in a particular instance are very clear, we must avoid it altogether.

I am also struggling with the merits and demerits of having analytic gradient expressions. My naive perspective is that, even if we had implementations of schur and trsyl, it would still be better to manually implement the lyapunov ops, because we have an analytic expression of the reverse-mode gradients. We relieve the system of the need to compute these, and side-step any potential issues arising from approximation.

First, we always have "analytic" expression for the gradients. Numerical approximations are not produced or used by Aesara.

Second, the Op.grad implementations you've already produced necessarily contain the same gradient "information" that would go into Schur and trsyl Op.grad implementations. More specifically, since trsyl is a solver for the Sylvester equations, you already have its Op.grad implementation in the paper referenced to produce the implementations for the Lyapunov Ops.

The Op.grad implementation for the Schur decomposition is perhaps the only open question.

@jessegrabowski
Copy link
Contributor Author

Thanks for laying it out clearly. I'm clearly not getting it through my head that aesara straddles the line between sympy and numpy. I appreciate your patience as I learn all this.

I'm happy to work on trsyl. As you note, the reverse-mode equations are already worked out in the same paper I'm already working from. I can submit it as a separate PR. I also have an Op for solve_discrete_are worked out, but this uses scipy.linalg.ordqz, scipy.linalg.qr, and scipy.linalg.lu, which would all need to be implemented to avoid just directly wrapping scipy.linalg.solve_discrete_are. QR decomposition is in tensor.nlinalg but it doesn't appear to have gradients implemented.

I spent this morning doing some research regarding the schur decomposition, and it seems it would be non-trivial to implement. I couldn't find an implementation in TF, Torch, or mxnet. I dug around on google scholar and found implementations for gradients of SVD, QR, and LU decompositions (here and here), but not Schur or QZ.

If anyone has a reference, I'm happy to work on figuring an implementation out. But, as I'm sure you can tell from repeated interactions with me, I'm already extremely out of my depth here. Trying to work out the gradients myself from first principals is not likely to happen.

@brandonwillard
Copy link
Member

I can submit it as a separate PR.

If we can't get a gradient for the Schur decomposition in place, then a separate PR for trsyl would be the next best thing.

I also have an Op for solve_discrete_are worked out, but this uses scipy.linalg.ordqz, scipy.linalg.qr, and scipy.linalg.lu, which would all need to be implemented to avoid just directly wrapping scipy.linalg.solve_discrete_are. QR decomposition is in tensor.nlinalg but it doesn't appear to have gradients implemented.

Same with those other decompositions (e.g. QR, LU) with known gradients; having Ops and/or Op.grad implementations for those would be a big help.

In the end, we really only need to do this sort of due diligence to determine exactly why something can/can't reasonably be done at a lower Op level. Knowing specifically that the roadblock is a derivative for the Schur decomposition is incredibly useful, because we can—among other things—create an issue for it and explicitly track its relevance to other Op implementations and, ideally, gather information and make progress on that issue over time.

I spent this morning doing some research regarding the schur decomposition, and it seems it would be non-trivial to implement. I couldn't find an implementation in TF, Torch, or mxnet. I dug around on google scholar and found implementations for gradients of SVD, QR, and LU decompositions (here and here), but not Schur or QZ.

If anyone has a reference, I'm happy to work on figuring an implementation out. But, as I'm sure you can tell from repeated interactions with me, I'm already extremely out of my depth here. Trying to work out the gradients myself from first principals is not likely to happen.

It wouldn't surprise me if someone hasn't published a solution to this one, but my first thought is that the approach would be very similar to the one for QR, since the two decompositions are related. If I recall, a common QR algorithm essentially computes the Schur decomposition, but there might be some subtleties involving real/imaginary results.

@twiecki
Copy link
Contributor

twiecki commented Jul 3, 2022

@brandonwillard But if we can't easily get Schur, we can't get lyapunov that way, no? If that's the case it sounds like this PR would be our next best bet.

@brandonwillard
Copy link
Member

brandonwillard commented Jul 3, 2022

@brandonwillard But if we can't easily get Schur, we can't get lyapunov that way, no? If that's the case it sounds like this PR would be our next best bet.

Yes, that's part of what I was saying. There are other options we could consider, though. For instance, it might be straightforward to identify Schur + trsyl combinations and provide gradients for those—or something similar.

@brandonwillard brandonwillard changed the title Add Ops for solve_discrete_lyapunov and solve_continuous_lyapunov Add Ops for solve_discrete_lyapunov and solve_continuous_lyapunov Jul 3, 2022
@brandonwillard brandonwillard added the tensor algebra Relates to our use and representations of tensor algebra label Jul 3, 2022
@codecov
Copy link

codecov bot commented Jul 3, 2022

Codecov Report

Attention: Patch coverage is 81.15942% with 13 lines in your changes missing coverage. Please review.

Project coverage is 74.16%. Comparing base (14c394d) to head (48965b2).
Report is 246 commits behind head on main.

Files with missing lines Patch % Lines
aesara/tensor/slinalg.py 81.15% 11 Missing and 2 partials ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1020      +/-   ##
==========================================
+ Coverage   74.10%   74.16%   +0.06%     
==========================================
  Files         174      174              
  Lines       48673    48774     +101     
  Branches    10373    10376       +3     
==========================================
+ Hits        36067    36175     +108     
+ Misses      10315    10311       -4     
+ Partials     2291     2288       -3     
Files with missing lines Coverage Δ
aesara/tensor/slinalg.py 83.79% <81.15%> (-0.61%) ⬇️

... and 4 files with indirect coverage changes

@aseyboldt
Copy link
Contributor

aseyboldt commented Jul 4, 2022

I hope I don't seem like a broken record when saying this: But implementing the backward mode gradient operations through eg the schur decomposition isn't ideal.

Let's say we want to compute Lyapunov(A=2 * eye, Q=-eye), so we have $AXA^H - X + Q = 0$ or $X = \frac{1}{3}I$.

For a given $\bar{X} $ the values for $\bar{Q} $ and $\bar{A} $ are also nicely defined, and the formulas from the paper work just fine.

But if we compute the gradients through the schur decomposition we run into trouble:

First we can notice that because $A $ is symmetric, the schur decomposition coincides with the eigenvalue decomposition. But because the eigenvalues of $A $ are not unique, the function that maps $A $ to its eigenvalue decomposition isn't a well defined function in the mathematical sense, because any orthogonal matrix $Q $ can be used as eigenvector matrix. In the forward code this isn't an issue, because we don't care which $Q $ we get as long as it is orthogonal. However, the backward operation of the eigenvalue decomposition isn't well defined because of this, which shows up in the formulas for $\bar{A} $ as a division by zero:

image

This makes me think that the Op as it is implemented in this PR is actually better than a rewrite that uses the schur decomposition and its gradient directly. It is reasonably stable, tested and produces values and gradients if they are well defined (ie $\lambda_i(A) \lambda_j(A) \neq 1$), and this is probably also true for most other implicitly defined functions of this kind (eg sylvester, riccatti etc). I for one would really like to have this functionality in aesara.

But if we want to have an even better solution, maybe we could get the best of both worlds: If we have forward ops for schur, trsyl, qz etc, we could write the lyaponov et al Ops as aesara.compile.builders.OpFromGraph or something similar. This way we could tell aesara more details about what's happening in the solver code and also optimize those expressions (for instance the current PR would compute the schur decomposition of $A$ twice, once in the forward code and once in the gradient, and aesara should be smart enough to notice this and remove the redundant work from the graph during common sub-expression elimination). We could still provide custom code for the gradients though. I'd be a disappointed though if this was a strict requirement and because nobody wants to do the work (and I think it is quite a bit of work) and because of that we don't get this functionality at all.

@brandonwillard
Copy link
Member

brandonwillard commented Jul 5, 2022

I hope I don't seem like a broken record when saying this: But implementing the backward mode gradient operations through eg the schur decomposition isn't ideal.

I've still yet to see how "expanding" the Lyapunov-solving function/operator into its constituent parts wouldn't be ideal. In other words, if it could be done, why wouldn't it be better than the un-expanded version?

So far, in this discussion, the only concrete issue I've noticed is whether or not one component of the expanded version could be implemented without more effort than it's theoretically worth, but that's distinct from the question(s) "Is it possible or ideal?".

Aside from that, I've already given a directly relevant example demonstrating how such expansions can be at least as performant as their un-expanded counterparts (i.e. the discrete "direct" case). Just so we're clear, the take-away from that example is that we should dispatch based on the method, so that the expanded version is used when possible, and, if need be, we can have an un-expanded Op for just the continuous case.

Let's say we want to compute Lyapunov(A=2 * eye, Q=-eye), so we have AXAH−X+Q=0 or X=13I.
For a givenX¯ the values forQ¯ andA¯ are also nicely defined, and the formulas from the paper work just fine.
But if we compute the gradients through the schur decomposition we run into trouble:

It sounds like you're restating one of the potential challenges behind implementing a complete gradient for the constituent Schur step, and not necessarily making a statement about the performance or quality of an expanded approach–both of which are important aspects of an ideal/not ideal approach.

Before going further, it should be clear that we're talking about one very specific "degenerate" set of inputs for which we have not even performed any sort of analysis for the un-expanded case (i.e. current implementation)–let alone a relevant comparative analysis with an expanded implementation of any form.

That said, this is really a constrained issue, and one that might be surmountable in any number of ways–some of which could have everything/nothing to do with the specifics of an Op.grad implementation or rewriting.

This is especially relevant given that the provided example equates eigen-decompositions with Schur decompositions, which may be true in some cases at a high-level, but not true in terms of the solution-generating processes and their choices regarding mathematically ambiguous mappings, subgradients, removable singularities, etc.

If we want consistent gradient implementations, then we need to be consistent
with these solution-generating processes and the definitions they use—not just
the high-level mathematical definitions.

First we can notice that becauseA is symmetric, the schur decomposition coincides with the eigenvalue decomposition. But because the eigenvalues ofA are not unique, the function that mapsA to its eigenvalue decomposition isn't a well defined function in the mathematical sense, because any orthogonal matrixQ can be used as eigenvector matrix.

The actual function mapping $A$ to its eigen-decomposition (i.e. numpy.linalg.eigh) appears to be well defined. You demonstrated this by showing that it chooses the Euclidean basis vectors by returning an identity matrix. Sure, the very broadly defined mathematical definition of an eigen-decomposition doesn't specify as much, but that's because it doesn't particularly serve the mathematical definition to do so. Nevertheless, the actual mappings we use are well-defined enough to specify valid eigenvectors, so that point is moot.

More importantly, consider a simpler function with a similar issue: the absolute value.

import numpy as np
import aesara
import aesara.tensor as at


x = at.scalar("x")
y = at.abs(x)

y_fn = aesara.function([x], [y, aesara.grad(y, x)])

y_fn(0.0)
# [array(0.), array(0.)]

As with the absolute value, it's possible to choose a specific subgradient value. I don't see why the same isn't possible/reasonable here.

In the forward code this isn't an issue, because we don't care whichQ we get as long as it is orthogonal. However, the backward operation of the eigenvalue decomposition isn't well defined because of this, which shows up in the formulas for A¯ as a division by zero:

I keep noticing mention of "forward" and "backward", but I don't yet see why these distinctions are relevant, so I can't address those aspects at the moment.

Regardless, let's see if we can quickly do something along the lines of the subgradient approach mentioned above, but for this exact example.

Currently, our Eigh.grad is an awkward implementation that is driven by an EighGrad Op, making it an un-expanded implementation (i.e. its gradient doesn't generate an Aesara graph comprised of other, "lower-level" Ops).

For comparison, we'll simply add the implementation from Giles, Mike. 2008. “An Extended Collection of Matrix Derivative Results for Forward and Reverse Mode Automatic Differentiation.” to a custom Eig class. (Since Eig, and most other Ops in aesara.tensor.linalg, don't have Op.grad implementations, this can also serve as a test/example of some sorely needed additions.)

import numpy as np
import aesara
import aesara.tensor as at
from aesara.tensor.nlinalg import Eig, _zero_disconnected


A = at.matrix("A")

d, U = at.linalg.eigh(A)
eigh_fn = aesara.function([A], [d, U, at.grad(U.sum(), A)])


class MyEig(Eig):
    def L_op(
        self,
        inputs,
        outputs,
        g_outputs,
    ):
        (A,) = inputs
        d, U = outputs
        dd, dU = _zero_disconnected([d, U], g_outputs)
        dD = at.diag(dd)

        # Compute all differences of the elements in `d`, i.e. `d[j] - d[i]`
        E = at.outer(at.ones(d.shape[0]), d) - d[..., None]

        # This is what the current version of `Eigh.grad` effectively does:
        # from aesara.tensor.subtensor import set_subtensor
        # non_diag_mask = tm.invert(at.eye(E.shape[0], dtype="bool"))
        # F = at.zeros_like(E)
        # F = set_subtensor(F[non_diag_mask], tm.reciprocal(E[non_diag_mask]))

        # This replaces all `d[j] == d[i]` with 0, instead of just the diagonal
        # of E
        F = at.switch(at.neq(E, 0.0), at.reciprocal(E), 0.0)

        # TODO: This inverse probably isn't good.
        dA = at.linalg.inv(U).T @ (dD + F * U.T @ dU) @ U.T
        return [dA]


myeig = MyEig()

d_2, U_2 = myeig(A)
myeig_fn = aesara.function([A], [d_2, U_2, at.grad(U_2.sum(), A)])


rng = np.random.default_rng(2039)
A_val = rng.normal(size=(3, 3))
A_val = A_val.T @ A_val
A_val = A_val + A_val.T

d_val, U_val, dU_sum_val = eigh_fn(A_val)
d_val, U_val, dU_sum_val
# [array([ 1.60014544,  6.97086699, 10.3810344 ]),
#  array([[ 0.60544104, -0.06363686,  0.79334198],
#         [-0.65820594, -0.6004278 ,  0.4541491 ],
#         [-0.44744396,  0.7971429 ,  0.40540979]]),
#  array([[-0.0907336 ,  0.        ,  0.        ],
#         [ 0.32372021,  0.14822482,  0.        ],
#         [-0.30375936,  0.0925893 , -0.05749122]])]

# Rough check of the eigen-decomposition
np.allclose(U_val * d_val @ np.linalg.inv(U_val), A_val)
# True

%timeit eigh_fn(A_val)
# 148 µs ± 4.18 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

d_2_val, U_2_val, dU_sum_2_val = myeig_fn(A_val)
d_2_val, U_2_val, dU_sum_2_val
# [array([ 1.60014544, 10.3810344 ,  6.97086699]),
#  array([[-0.60544104,  0.79334198, -0.06363686],
#         [ 0.65820594,  0.4541491 , -0.6004278 ],
#         [ 0.44744396,  0.40540979,  0.7971429 ]]),
#  array([[-0.03123565, -0.12868062, -0.41475149],
#         [ 0.01339011,  0.05516287,  0.17779589],
#         [-0.01800771, -0.07418586, -0.23910902]])]

np.allclose(U_2_val * d_2_val @ np.linalg.inv(U_2_val), A_val)
# True

%timeit myeig_fn(A_val)
# 97.6 µs ± 1.3 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

# Perform some rough checks to make sure that the new gradient implementation
# for our new `Op` is working:
aesara.gradient.verify_grad(lambda A: myeig(A)[0].sum(), [A_val], rng=rng)

# How about at a (previously) "undefined" input?
aesara.gradient.verify_grad(lambda A: myeig(A)[0].sum(), [np.eye(3)], rng=rng)

# What about `Eigh`?
aesara.gradient.verify_grad(lambda A: at.linalg.eigh(A)[0].sum(), [A_val], rng=rng)

aesara.gradient.verify_grad(lambda A: at.linalg.eigh(A)[0].sum(), [np.eye(3)], rng=rng)
# ValueError: ('abs_err not finite', 'array([[nan,  0.,  0.],\n       [nan, nan,  0.],\n       [nan, nan, nan]])')

# Now, try the degenerate case:
eigh_fn(np.eye(3))
# [array([1., 1., 1.]),
#  array([[1., 0., 0.],
#         [0., 1., 0.],
#         [0., 0., 1.]]),
#  array([[nan,  0.,  0.],
#         [nan, nan,  0.],
#         [nan, nan, nan]])]

myeig_fn(np.eye(3))
# [array([1., 1., 1.]),
#  array([[1., 0., 0.],
#         [0., 1., 0.],
#         [0., 0., 1.]]),
#  array([[0., 0., 0.],
#         [0., 0., 0.],
#         [0., 0., 0.]])]

Like the expanded discrete Lyapunov example, this expanded gradient implementation is at least as performant as a comparable non-expanded implementation.

While the comparison isn't exact, because Eigh makes assumptions that Eig doesn't and produces different eigenvectors, it still sufficiently illustrates the points.

This makes me think that the Op as it is implemented in this PR is actually better than a rewrite that uses the schur decomposition and its gradient directly. It is reasonably stable, tested and produces values and gradients if they are well defined (ie λi(A)λj(A)≠1), and this is probably also true for most other implicitly defined functions of this kind (eg sylvester, riccatti etc). I for one would really like to have this functionality in aesara.

To reiterate, you've only stated one of the implementation challenges/choices, not a point regarding the relative quality of different implementations; otherwise, what you're saying appears to amount to "We have an implementation in this branch and I want to use it, so it's better than the unimplemented approaches we're discussing". It's fine to say you want this/a implementation, but it doesn't make sense to use that as a means of comparing their relevant qualities (aside from being present or not).

Also, functions are only as well-defined as they're implemented/specified: i.e. even if a common mathematical definition of a function isn't well-defined at certain points doesn't mean that a viable, refined definition of it isn't possible; however, you seem to be implying the latter.

@brandonwillard
Copy link
Member

brandonwillard commented Jul 6, 2022

@aseyboldt neatly summarized the possible approaches and their pros/cons in a private conversation, and one of the points he made was that what we're calling the "subgradient approach" (i.e. making "usable" Op.grad choices for degenerate values) is not likely to work as generally or easily as we'd want.

I absolutely agree that the OpFromGraph approach is probably our best bet for retaining the potential performance and implementation benefits of this PR's unexpanded approach and an Op-expanded approach.

@aseyboldt
Copy link
Contributor

My summary from that chat (slightly edited, I hope I didn't break anything @brandonwillard ):

We know that if we split the solve op and then chain the backward ops of the computation graph of solve_lyapunov (ie schur decomposition and trsyl) we get additional singularities that we don't want: The lyapunov equation always has a solution and gradient, unless $\lambda_i(A) \lambda_j(A) = 1$ for some $i, j$.
But using the derivatives of the schur ops etc would lead to singularities if $\lambda_i(A) = \lambda_j(A)$ for some $i, j$, and since A could have several zero eigenvalues in some applications this wouldn't be great.

We know of those 5 options:

  • Just ignore the problem and hope nobody cares for those A matrices
  • Use a single custom Op with custom gradient impl to directly use the manually computed backward op (this PR)
  • Split the lyapunov op using OpFromGraph and overwrite the gradient there. (With or without the gradient implementations for schur and trsyl, we wouldn't need those for this solution, even if of course they'd still be great to have).
  • Write a graph rewrite that transforms ops that look like they have those singularities
  • Try to figure out definitions for the singularities that make things work anyway

And my current personal feelings about those would be something like this:

  • About 1: I don't like ignoring the problem
  • About 2: This clearly isn't perfect but I'd be fine with it, at least as long as nobody actually does the work for one of the others. We unfortunately end up doing the schur decomposition of A twice as well, so this should be significantly slower in some cases.
  • About 3: I think this is my favorite, because it is pretty close to the math, more or less straight forward to implement and relatively easy to verify. But of course it is specific to this function and doesn't generalize.
  • About 4: I'd like this as well, but graph rewrites generally scare me a bit, especially if they are compilcated like this one might be, but I'd like to be pleasantly surprised by a nice rewrite that does this. I think I prefer to just generate a good graph in the first place where possible.
  • About 5: I wouldn't really know where to start, and it sounds scary to me, but if it works I think this would also be nice.

This turned out more tricky than I thought, I hope we didn't scare you away @jessegrabowski :-)

@jessegrabowski
Copy link
Contributor Author

jessegrabowski commented Jul 7, 2022

Not scared away, just staying quiet to avoid making myself look like a fool (a common occurrence).

Some questions:

  1. It seems the consensus is using an OpFromGraph then "overwriting the gradients"? I'm familiar with OpFromGraph but not with the ability to overwrite gradients. Is there another function somewhere that does this I could consult as a reference?

  2. Where I was planning to go was to follow @brandonwillard 's direct solver implementation, and dispatch either to that or to the continuous solver based on the size of the A matrix. This would be an intermediate step, until a solution for the schur step is found. Is this consistent with (1)?

  3. It also seems it might also be necessary to implement some LAPACK functions directly (infering from C implementations for LAPACK functions #1030). trsyl keeps coming up, but so does gees implicitly (the schur solver). Are there some examples of COps I can look at for reference here? I will post some more questions directly in that issue, since I imagine these will end up overlapping.

Plus a comment:

  1. On the connection between Schur and QR, I reached out to some experts in an effort to get a better understanding of the link here. Evidently QR can be applied blockwise as a solution algorithm for schur, but there is no "simple" one-to-one computational correspondence between the two. For example, gees doesn't directly call one of the qr solvers (geqrf, geqp3, etc), which I was really hoping it would. I am going to keep reaching out to researchers and I'll let you if I come up with anything.

@ricardoV94
Copy link
Contributor

  1. It seems the consensus is using an OpFromGraph then "overwriting the gradients"? I'm familiar with OpFromGraph but not with the ability to overwrite gradients. Is there another function somewhere that does this I could consult as a reference?

You can define a function for the gradient of an OpFromGraph instead of relying on the autodiff of the internal graph. See example 3 here: https://aesara.readthedocs.io/en/latest/library/compile/opfromgraph.html?highlight=OpFromGraph#opfromgraph

@brandonwillard
Copy link
Member

2. This would be an intermediate step, until a solution for the schur step is found. Is this consistent with (1)?

Yes, I believe so.

3. Are there some examples of COps I can look at for reference here?

aesara.tensor.blas contains a lot of the COps that use BLAS at the C level; however, most/all of them are subclasses of the same GemmRelated class, so they're not particularly good design examples.

aesara.scalar.math has a wide array of simple COp.c_code implementations that are probably worth looking at. As long as there's a BLAS/LAPACK-containing library that can be found by the linker, something not far from those simple implementations can be used—along with some COp.[c_libraries, c_compile_args, c_lib_dirs, c_header_dirs] overrides that make the compiler aware of the required external libraries.

Also, if NumPy/SciPy expose their own C-level means of accessing BLAS/LAPACK functions, then one can always use those.

@brandonwillard brandonwillard added Op implementation Involves the implementation of an Op and removed new Op labels Jul 11, 2022
@jessegrabowski
Copy link
Contributor Author

I updated this PR with the following changes:

  1. The direct solver for the discrete lyapunov case is now written in aesara, based on the implementation by @brandonwillard presented above.
  2. A check for complex objects has been added to avoid using the .conj() method when it is not needed. This should provide gradients in the case that all matrices are real, which is the most common use case I think.
  3. None is no longer an accepted value for the method parameter in solve_discrete_lyapunov. The default is direct, which calls the pure aesara implementation.
  4. The docstring on solve_discrete_lyapunov has been updated to explain that direct should be preferred in all cases, except when N is large, or when inputs are complex and gradients are required.
  5. Two additional test cases were added, covering real and complex inputs to the new _solve_discrete_lyapunov_direct function.
  6. The discrete lyapunov Op with gradients implemented "by hand" has been renamed to SolveDiscreteLyapunovBilinear, and no longer takes a method parameter on initialization.

My idea with this setup is to direct users to the Aesara implementation, while leaving the "by hand" version until we can cover all the corner cases: complex gradients, native aesara implementation of the continuous case. If the consensus is to remove it all together, though, I can do that.

Copy link
Member

@brandonwillard brandonwillard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look great. We just need to resolve the conflict(s) and we can get to merging this.

@jessegrabowski
Copy link
Contributor Author

jessegrabowski commented Sep 29, 2022

I think I messed this all up trying to update the branch to match the current aesara main and resolve the conflicts, I maybe should delete it and try again?

@brandonwillard
Copy link
Member

I think I messed this all up trying to update the branch to match the current aesara main and resolve the conflicts, I maybe should delete it and try again?

One minute; I'll take a (local) look.

Copy link
Member

@brandonwillard brandonwillard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed some changes to this branch. @jessegrabowski, take a look and make sure all your changes are in there.

If you want to make more changes, you'll need to fetch the upstream changes and do something like git rebase --skip <origin-name>/solve_lyapunov to pull in these remote changes and overwrite your local ones.

Before you do that, make sure your <origin-name> (e.g. origin) remote is set to pull from your fork (i.e. jessegrabowski/aesara.git) where these changes reside. You can use git remotes to do that. (N.B. There are a lot of ways to do this, but the important point is that you want to overwrite your local changes with whatever is here in jessegrabowski:solve_lyapunov.)

You should probably check out your upstream remote, too, and just make sure it points to aesara-devs/aesara.git.

@jessegrabowski
Copy link
Contributor Author

Thanks for taking the time to clean up my mess. I'll take some time to get my git situation squared away before I move forward with any other contributions.

brandonwillard
brandonwillard previously approved these changes Sep 30, 2022
@andrejmuhic
Copy link

andrejmuhic commented Oct 6, 2022

If it helps anyone, I am also posting here. As a reference I have background in applied linear algebra.
I just wanted to add that the Schur form is numerically computed using QR step algorithm (I prefer this name over QR algorithm ) which is somewhat related to the power iteration, the subspace flavour version. This is efficiently done with shifts and implicit bulge chasing, usually out of scope for the implementation from scratch, at least it does not make sense to do so as lapack version is available and it is hard to do better than that. The additional complication is if one wants the real Schur form, the upper quasi-triangular matrix with 1-by-1 and 2-by-2 blocks, where 2-by-2 blocks correspond to conjugate complex eigenvalue pairs.
The C++ code implementation that does this directly from the decompositions and their properties in projective gradient like way is available for example in:
https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/FunctionsManual.cpp
The other option would be to backprop through the steps of algorithm directly but that requires numerically stable reimplementation with autodiff support and also ensuring that the gradient is numerically stable, this sounds hard?
I think the Schur decomposition is not there but following the documented code and resources in comments it should be possible to produce formula for the eigenvalues gradient if one really desired the more stable formula.

@rlouf
Copy link
Member

rlouf commented Nov 24, 2022

Sorry for the delay in reviewing this. It looks like Brandon approved this PR, so unless the tests fail after rebasing on main I will be merging this. Thank you for your contribution, and your patience!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Op implementation Involves the implementation of an Op SciPy compatibility tensor algebra Relates to our use and representations of tensor algebra
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants