Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fuse Elemwise graphs that have multiple outputs and clients #121

Merged
merged 13 commits into from
Feb 8, 2023

Conversation

ricardoV94
Copy link
Member

@ricardoV94 ricardoV94 commented Dec 13, 2022

Continuation of aesara-devs/aesara#1242, as I've been blocked on that repo

More context can be found in aesara-devs/aesara#1237

Clarifying what is the idea behind allowing Composites with multiple clients. The following example:

import pytensor as aesara
import pytensor.tensor as as at

x = at.vector("x")
y = at.exp(x/5)
w = y + 1
z = y * 2

f1 = aesara.function([x], [z, w])
aesara.dprint(f1)

Before this PR it produced this graph:

Elemwise{Mul}[(0, 1)] [id A] 2
 |TensorConstant{(1,) of 2.0} [id B]
 |Elemwise{Composite{exp((i0 * i1))}} [id C] 0
   |TensorConstant{(1,) of 0.2} [id D]
   |x [id E]
Elemwise{add,no_inplace} [id F] 1
 |TensorConstant{(1,) of 1.0} [id G]
 |Elemwise{Composite{exp((i0 * i1))}} [id C] 0

After:

Elemwise{Composite{(i2 * exp((i0 * i1))), (i3 + exp((i0 * i1)))}}.0 [id A] 0
 |TensorConstant{(1,) of 0.2} [id B]
 |x [id C]
 |TensorConstant{(1,) of 2.0} [id D]
 |TensorConstant{(1,) of 1.0} [id E]
Elemwise{Composite{(i2 * exp((i0 * i1))), (i3 + exp((i0 * i1)))}}.1 [id A] 0

The expected savings come from having to iterate only once over the data vector x, vs having to iterate 3 times before, once over x and twice over exp(i0 * i1).

Another example is the following graph:

x = at.dvector("x")
mu = at.dvector("mu")
logp = (- ((x - mu) **2) / 2)
grad = at.grad(logp.sum(), x)
f2 = aesara.function([mu, x], [logp, grad])
aesara.dprint(f2)

Before:

Elemwise{Composite{(i0 * sqr(i1))}}[(0, 1)] [id A] 2
 |TensorConstant{(1,) of -0.5} [id B]
 |Elemwise{sub,no_inplace} [id C] 0
   |x [id D]
   |mu [id E]
Elemwise{neg,no_inplace} [id F] 1
 |Elemwise{sub,no_inplace} [id C] 0

After:

Elemwise{Composite{(-(i0 - i1)), (i2 * sqr((i0 - i1)))}}.1 [id A] 0
 |x [id B]
 |mu [id C]
 |TensorConstant{(1,) of -0.5} [id D]
Elemwise{Composite{(-(i0 - i1)), (i2 * sqr((i0 - i1)))}}.0 [id A] 0

Again we replace 3 loops by 1.

For more detail, here is the C code of the Composite scalar in the first function:

// function 1
{
npy_float64 V13_tmp1;
// i0 * i1
V13_tmp1 = V11_i * V9_i;

npy_float64 V13_tmp2;
// exp(i0 * i1)
V13_tmp2 = exp((npy_float64)V13_tmp1);

// First output
// i2 * exp(i0 * i1)
V1_i = V7_i * V13_tmp2;

// Second output
// i3 + exp(i0 * i1)
V3_i = V5_i + V13_tmp2;
}
// function 2
{
npy_float64 V11_tmp1;
// i0 - i1
V11_tmp1 = V9_i - V7_i;

// First output
// - (i0 - i1)
V1_i = -V11_tmp1;

npy_float64 V11_tmp2;
// sqr(i0 - i1)
V11_tmp2 = V11_tmp1 * V11_tmp1;

// Second output
// i2 * sqr(i0 - i1)
V3_i = V5_i * V11_tmp2;
}

You can see that it avoids recomputing the same sub-expressions in both cases. It does store more intermediate variables than needed, but hopefully the compiler will be able to get rid of those.

This would be further improved by inplacing the scalars as is being pursued in #107

TODO:

New features

  • Fuse Elemwise graphs that have multiple outputs and non-fuseable clients

Bugfixes

  • Fix bug when creating Composite with multiple identical outputs

@ricardoV94 ricardoV94 added bug Something isn't working enhancement New feature or request labels Dec 13, 2022
@ricardoV94 ricardoV94 force-pushed the multiple_output_fusion branch 3 times, most recently from 50471b6 to af76f79 Compare December 13, 2022 16:16
@ricardoV94
Copy link
Member Author

@OriolAbril Any idea why readthedocs is failing?

@OriolAbril
Copy link
Member

I suspect github is laggy or partially down. The error was that readthedocs was unable to find the reference to this PR. I restarted the job and now it has been able to get the PR changes. 🤷🏿

@ricardoV94 ricardoV94 force-pushed the multiple_output_fusion branch 2 times, most recently from 66dc5e8 to d074db1 Compare December 13, 2022 19:24
@ricardoV94 ricardoV94 force-pushed the multiple_output_fusion branch 4 times, most recently from 3c175ad to 91fb488 Compare December 14, 2022 09:45
@ricardoV94
Copy link
Member Author

ricardoV94 commented Dec 14, 2022

Yup, the slow and skip marks seem to be ignored in the CI #5 (comment)

@ricardoV94 ricardoV94 force-pushed the multiple_output_fusion branch 2 times, most recently from 157ad77 to 4931a2d Compare December 14, 2022 10:55
@ricardoV94 ricardoV94 marked this pull request as draft December 14, 2022 12:44
@ricardoV94 ricardoV94 force-pushed the multiple_output_fusion branch 2 times, most recently from 234bb73 to 961e9d4 Compare December 14, 2022 14:39
@ricardoV94
Copy link
Member Author

Alright first time the tests pass (minus the ones I disabled :D)

@ricardoV94
Copy link
Member Author

ricardoV94 commented Dec 15, 2022

Benchmark, running this snippet after this PR vs main:

import pytensor
import pytensor.tensor as pt
import numpy as np

rng = np.random.default_rng(123)
size = 100_000
x = pytensor.shared(rng.normal(size=size), name="x")
mu = pytensor.shared(rng.normal(size=size), name="mu")

logp = (- ((x - mu) **2) / 2)
grad = pt.grad(logp.sum(), x)

func = pytensor.function([], [logp, grad])
pytensor.dprint(func)
%timeit -n 1000 func()

This PR: 145 µs ± 15.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Main: 220 µs ± 27.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Exact numbers change quite a bit when I rerun the script, but they don't overlap. The difference grows with size, and becomes way less noticeable (or vanishes) for smaller sizes.

Would appreciate if someone else could replicate the benchmark.

@aseyboldt
Copy link
Member

Timings on my desktop machine give a similar pattern.
image

@ricardoV94 ricardoV94 changed the title Fuse Elemwise graphs that have multiple outputs and non-fuseable clients Fuse Elemwise graphs that have multiple outputs and clients Dec 16, 2022
@ricardoV94 ricardoV94 force-pushed the multiple_output_fusion branch 3 times, most recently from 3f9b197 to e889d2c Compare December 16, 2022 13:01
@ricardoV94 ricardoV94 force-pushed the multiple_output_fusion branch 4 times, most recently from 53e3a67 to ceb53da Compare February 7, 2023 19:35
new_node = ret[0].owner
else:
break
def apply(self, fgraph):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main loop looks much better, but the local functions make it still really hard to read, as it is ~450 lines in total.
Would it be feasible to extract these local functions and pass a few more things as args/kwargs if needed?

Your call though

("float32", "float32"),
),
marks=pytest.mark.xfail, # Not implemented yet
),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to sound nitpicky, but you're adding more lines to the parametrize (74) than the test method has to begin with (61).

I mean sure, there were 600 lines of parameterize before, but like... maybe some of these parametrize items deserve their own test method

Copy link
Member Author

@ricardoV94 ricardoV94 Feb 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a rewrite that affects 99.9% of all graphs in a very fundamental manner, I don't think it's crazy that we would test so many parametrizations.

Besides autodiff, loop fusion is basically the other thing that PyTensor does that goes beyond vanilla Numpy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fully support having all these tests. I'm just questioning whether this is the most maintainable way to organize the test code.
For example, the test implementation could be extracted into a function, and some of the most involved parametrizations that have these 10ish levels of indentation could just be their own test methods.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... We can also use a generator to yield the parametrizations, but I am not sure that's much better? I went through the parametrizations and there is only one that seems too long. I rearranged it to have less nesting:

            (
                (
                    log(
                        ge(
                            assert_op(
                                at_abs(fx),
                                at_all(ge(at_abs(fx), 0)),
                            ),
                            0,
                        )
                    ),
                ),
                (fx,),
                (fxv,),
                4,
                (np.zeros_like(fxv),),
                ("float32",),
            ),

* Move local_add_mul_fusion to `rewriting/elemwise` and remove unused/duplicated TestAddMulFusion tests
* Use EquilibriumGraphRewriter for local_add_mul_fusion
* Do not register optional rewrites if tensor__local_elemwise_fusion flag is disabled
@ricardoV94
Copy link
Member Author

Ready for re-review

Copy link
Member

@michaelosthege michaelosthege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for cleaning up the code some more, @ricardoV94

I know that you invested a lot of attention on the details here. If there is anything where you're in doubt and need a careful review let me know. It's just too much for one pass..

@ricardoV94
Copy link
Member Author

thanks for cleaning up the code some more, @ricardoV94

I know that you invested a lot of attention on the details here. If there is anything where you're in doubt and need a careful review let me know. It's just too much for one pass..

Yeah, it's a though one to review. Thanks for the suggestions to make it a tiny bit more readable. I think I will merge it and iterate in future PRs

@ricardoV94
Copy link
Member Author

Actually still have to clean the comments from one function... BRB

This was not an issue in my local machine, but failed on the Github CI. It could be due to compiler optimizations. Case 69 used to look like this:

```python
 Elemwise{Composite{(i0 * tan(i0) * tan(i0) * i1)}} [id C]
 |x [id A]
 |x [id A]
 ```

 And now looks like this
 ```python
 Elemwise{Composite{(i0 * tan(i0) * tan(i0) * i0)}} [id C]
 |x [id A] [None]
 ```
It doesn't make sense to include `fast_run` if `fast_compile` mode is being used. Some rewrites such as the FusionOptimizer are not compatible with `fast_compile` mode which prevents the creation of C thunks. The FusionOptimizer has no way of knowing this is the case, and assumes it is safe to return Composites with more than 32 operands, even though that's not the case with the Python perform method.
@ricardoV94 ricardoV94 merged commit 5521d82 into pymc-devs:main Feb 8, 2023
@ricardoV94 ricardoV94 changed the title Fuse Elemwise graphs that have multiple outputs and clients Fuse Elemwise graphs that have multiple outputs and clients Feb 15, 2023
@ricardoV94 ricardoV94 deleted the multiple_output_fusion branch June 21, 2023 08:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants