Fuse `Elemwise` graphs that have multiple outputs and clients #121

ricardoV94 · 2022-12-13T15:55:39Z

Continuation of aesara-devs/aesara#1242, as I've been blocked on that repo

More context can be found in aesara-devs/aesara#1237

Clarifying what is the idea behind allowing Composites with multiple clients. The following example:

import pytensor as aesara
import pytensor.tensor as as at

x = at.vector("x")
y = at.exp(x/5)
w = y + 1
z = y * 2

f1 = aesara.function([x], [z, w])
aesara.dprint(f1)

Before this PR it produced this graph:

Elemwise{Mul}[(0, 1)] [id A] 2
 |TensorConstant{(1,) of 2.0} [id B]
 |Elemwise{Composite{exp((i0 * i1))}} [id C] 0
   |TensorConstant{(1,) of 0.2} [id D]
   |x [id E]
Elemwise{add,no_inplace} [id F] 1
 |TensorConstant{(1,) of 1.0} [id G]
 |Elemwise{Composite{exp((i0 * i1))}} [id C] 0

After:

Elemwise{Composite{(i2 * exp((i0 * i1))), (i3 + exp((i0 * i1)))}}.0 [id A] 0
 |TensorConstant{(1,) of 0.2} [id B]
 |x [id C]
 |TensorConstant{(1,) of 2.0} [id D]
 |TensorConstant{(1,) of 1.0} [id E]
Elemwise{Composite{(i2 * exp((i0 * i1))), (i3 + exp((i0 * i1)))}}.1 [id A] 0

The expected savings come from having to iterate only once over the data vector x, vs having to iterate 3 times before, once over x and twice over exp(i0 * i1).

Another example is the following graph:

x = at.dvector("x")
mu = at.dvector("mu")
logp = (- ((x - mu) **2) / 2)
grad = at.grad(logp.sum(), x)
f2 = aesara.function([mu, x], [logp, grad])
aesara.dprint(f2)

Before:

Elemwise{Composite{(i0 * sqr(i1))}}[(0, 1)] [id A] 2
 |TensorConstant{(1,) of -0.5} [id B]
 |Elemwise{sub,no_inplace} [id C] 0
   |x [id D]
   |mu [id E]
Elemwise{neg,no_inplace} [id F] 1
 |Elemwise{sub,no_inplace} [id C] 0

After:

Elemwise{Composite{(-(i0 - i1)), (i2 * sqr((i0 - i1)))}}.1 [id A] 0
 |x [id B]
 |mu [id C]
 |TensorConstant{(1,) of -0.5} [id D]
Elemwise{Composite{(-(i0 - i1)), (i2 * sqr((i0 - i1)))}}.0 [id A] 0

Again we replace 3 loops by 1.

For more detail, here is the C code of the Composite scalar in the first function:

// function 1
{
npy_float64 V13_tmp1;
// i0 * i1
V13_tmp1 = V11_i * V9_i;

npy_float64 V13_tmp2;
// exp(i0 * i1)
V13_tmp2 = exp((npy_float64)V13_tmp1);

// First output
// i2 * exp(i0 * i1)
V1_i = V7_i * V13_tmp2;

// Second output
// i3 + exp(i0 * i1)
V3_i = V5_i + V13_tmp2;
}

// function 2
{
npy_float64 V11_tmp1;
// i0 - i1
V11_tmp1 = V9_i - V7_i;

// First output
// - (i0 - i1)
V1_i = -V11_tmp1;

npy_float64 V11_tmp2;
// sqr(i0 - i1)
V11_tmp2 = V11_tmp1 * V11_tmp1;

// Second output
// i2 * sqr(i0 - i1)
V3_i = V5_i * V11_tmp2;
}

You can see that it avoids recomputing the same sub-expressions in both cases. It does store more intermediate variables than needed, but hopefully the compiler will be able to get rid of those.

This would be further improved by inplacing the scalars as is being pursued in #107

TODO:

Wait for Numba to allow multi-output Elemwise, or figure out how to disable rewrite in that backend
Figure out what to do with inplace optimization when there are multiple outputs
- Deferred as a separate issue: Implement specialized inplace rewriter for Elemwise Composite Ops #138
Refactor the specialized add_mul_fusion
Do not split fusion ops depending on c-code presence in non-c backends
- Similarly, get rid of fragile elemwise_max_input_fct
Improve Op string, it now looks like a monster with many outputs
Profile expected performance gains! # Will do after 🔄 From Aesara: #1347 and # 1365: " Add CI support for benchmarking" #139

New features

Fuse Elemwise graphs that have multiple outputs and non-fuseable clients

Bugfixes

Fix bug when creating Composite with multiple identical outputs

ricardoV94 · 2022-12-13T16:16:37Z

@OriolAbril Any idea why readthedocs is failing?

OriolAbril · 2022-12-13T16:29:37Z

I suspect github is laggy or partially down. The error was that readthedocs was unable to find the reference to this PR. I restarted the job and now it has been able to get the PR changes. 🤷🏿

ricardoV94 · 2022-12-14T10:49:53Z

Yup, the slow and skip marks seem to be ignored in the CI #5 (comment)

ricardoV94 · 2022-12-14T19:35:50Z

Alright first time the tests pass (minus the ones I disabled :D)

pytensor/tensor/elemwise.py

ricardoV94 · 2022-12-15T12:21:58Z

Benchmark, running this snippet after this PR vs main:

import pytensor
import pytensor.tensor as pt
import numpy as np

rng = np.random.default_rng(123)
size = 100_000
x = pytensor.shared(rng.normal(size=size), name="x")
mu = pytensor.shared(rng.normal(size=size), name="mu")

logp = (- ((x - mu) **2) / 2)
grad = pt.grad(logp.sum(), x)

func = pytensor.function([], [logp, grad])
pytensor.dprint(func)
%timeit -n 1000 func()

This PR: 145 µs ± 15.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Main: 220 µs ± 27.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Exact numbers change quite a bit when I rerun the script, but they don't overlap. The difference grows with size, and becomes way less noticeable (or vanishes) for smaller sizes.

Would appreciate if someone else could replicate the benchmark.

aseyboldt · 2022-12-15T23:11:07Z

Timings on my desktop machine give a similar pattern.

michaelosthege · 2023-02-07T22:07:05Z

pytensor/tensor/rewriting/elemwise.py

-                new_node = ret[0].owner
-            else:
-                break
+    def apply(self, fgraph):


The main loop looks much better, but the local functions make it still really hard to read, as it is ~450 lines in total.
Would it be feasible to extract these local functions and pass a few more things as args/kwargs if needed?

Your call though

michaelosthege · 2023-02-07T22:15:13Z

tests/tensor/rewriting/test_elemwise.py

+                    ("float32", "float32"),
+                ),
+                marks=pytest.mark.xfail,  # Not implemented yet
+            ),


I don't want to sound nitpicky, but you're adding more lines to the parametrize (74) than the test method has to begin with (61).

I mean sure, there were 600 lines of parameterize before, but like... maybe some of these parametrize items deserve their own test method

This is a rewrite that affects 99.9% of all graphs in a very fundamental manner, I don't think it's crazy that we would test so many parametrizations.

Besides autodiff, loop fusion is basically the other thing that PyTensor does that goes beyond vanilla Numpy

I fully support having all these tests. I'm just questioning whether this is the most maintainable way to organize the test code.
For example, the test implementation could be extracted into a function, and some of the most involved parametrizations that have these 10ish levels of indentation could just be their own test methods.

Hmm... We can also use a generator to yield the parametrizations, but I am not sure that's much better? I went through the parametrizations and there is only one that seems too long. I rearranged it to have less nesting:

( ( log( ge( assert_op( at_abs(fx), at_all(ge(at_abs(fx), 0)), ), 0, ) ), ), (fx,), (fxv,), 4, (np.zeros_like(fxv),), ("float32",), ),

* Move local_add_mul_fusion to `rewriting/elemwise` and remove unused/duplicated TestAddMulFusion tests * Use EquilibriumGraphRewriter for local_add_mul_fusion * Do not register optional rewrites if tensor__local_elemwise_fusion flag is disabled

ricardoV94 · 2023-02-08T09:53:15Z

Ready for re-review

michaelosthege

thanks for cleaning up the code some more, @ricardoV94

I know that you invested a lot of attention on the details here. If there is anything where you're in doubt and need a careful review let me know. It's just too much for one pass..

ricardoV94 · 2023-02-08T13:00:03Z

thanks for cleaning up the code some more, @ricardoV94

I know that you invested a lot of attention on the details here. If there is anything where you're in doubt and need a careful review let me know. It's just too much for one pass..

Yeah, it's a though one to review. Thanks for the suggestions to make it a tiny bit more readable. I think I will merge it and iterate in future PRs

ricardoV94 · 2023-02-08T13:00:42Z

Actually still have to clean the comments from one function... BRB

This was not an issue in my local machine, but failed on the Github CI. It could be due to compiler optimizations. Case 69 used to look like this: ```python Elemwise{Composite{(i0 * tan(i0) * tan(i0) * i1)}} [id C] |x [id A] |x [id A] ``` And now looks like this ```python Elemwise{Composite{(i0 * tan(i0) * tan(i0) * i0)}} [id C] |x [id A] [None] ```

It doesn't make sense to include `fast_run` if `fast_compile` mode is being used. Some rewrites such as the FusionOptimizer are not compatible with `fast_compile` mode which prevents the creation of C thunks. The FusionOptimizer has no way of knowing this is the case, and assumes it is safe to return Composites with more than 32 operands, even though that's not the case with the Python perform method.

ricardoV94 added bug Something isn't working enhancement New feature or request labels Dec 13, 2022

ricardoV94 force-pushed the multiple_output_fusion branch 3 times, most recently from 50471b6 to af76f79 Compare December 13, 2022 16:16

ricardoV94 force-pushed the multiple_output_fusion branch 2 times, most recently from 66dc5e8 to d074db1 Compare December 13, 2022 19:24

ricardoV94 mentioned this pull request Dec 13, 2022

Fix JAX dispatch for multi-output Composite #123

Merged

ricardoV94 force-pushed the multiple_output_fusion branch 4 times, most recently from 3c175ad to 91fb488 Compare December 14, 2022 09:45

ricardoV94 force-pushed the multiple_output_fusion branch 2 times, most recently from 157ad77 to 4931a2d Compare December 14, 2022 10:55

ricardoV94 marked this pull request as draft December 14, 2022 12:44

ricardoV94 force-pushed the multiple_output_fusion branch 2 times, most recently from 234bb73 to 961e9d4 Compare December 14, 2022 14:39

ricardoV94 mentioned this pull request Dec 14, 2022

Make tests compatible with newer version of JAX #133

Merged

ricardoV94 force-pushed the multiple_output_fusion branch from d5fa644 to 225a514 Compare December 14, 2022 17:58

ricardoV94 commented Dec 14, 2022

View reviewed changes

pytensor/tensor/elemwise.py Show resolved Hide resolved

ricardoV94 changed the title ~~Fuse Elemwise graphs that have multiple outputs and non-fuseable clients~~ Fuse Elemwise graphs that have multiple outputs and clients Dec 16, 2022

ricardoV94 force-pushed the multiple_output_fusion branch 3 times, most recently from 3f9b197 to e889d2c Compare December 16, 2022 13:01

ricardoV94 force-pushed the multiple_output_fusion branch 4 times, most recently from 53e3a67 to ceb53da Compare February 7, 2023 19:35

michaelosthege reviewed Feb 7, 2023

View reviewed changes

ricardoV94 force-pushed the multiple_output_fusion branch from ceb53da to e64fc66 Compare February 8, 2023 09:41

ricardoV94 added 6 commits February 8, 2023 10:49

Fix failing TestHyp2F1Broadcast

daabeb3

Cleanup Fusion rewrites

a2f101a

* Move local_add_mul_fusion to `rewriting/elemwise` and remove unused/duplicated TestAddMulFusion tests * Use EquilibriumGraphRewriter for local_add_mul_fusion * Do not register optional rewrites if tensor__local_elemwise_fusion flag is disabled

Add direct test for nested broadcasted Composite graphs

00b5b90

Fix bug in Composite when multiple outputs are identical

da87b0c

Disable invalid inplace logic for multiple-output Composites

1f49094

Exclude unnecessary inputs in useless_composite rewrite

dd45099

ricardoV94 force-pushed the multiple_output_fusion branch from e64fc66 to 25e70e8 Compare February 8, 2023 09:52

michaelosthege approved these changes Feb 8, 2023

View reviewed changes

ricardoV94 added 7 commits February 8, 2023 14:03

Fuse consecutive Elemwise nodes with multiple clients

3304baf

Fix wrong backend in Numba logsumexp benchmark

e34ca2e

Add benchmark tests for fused Elemwises

2c63155

Add benchmark test for FusionRewriter

2d8d83b

Render inner-graphs of Composite Ops in debugprint

11c2b2c

ricardoV94 force-pushed the multiple_output_fusion branch from 25e70e8 to 11c2b2c Compare February 8, 2023 13:04

ricardoV94 merged commit 5521d82 into pymc-devs:main Feb 8, 2023

ricardoV94 changed the title ~~Fuse Elemwise graphs that have multiple outputs and clients~~ Fuse Elemwise graphs that have multiple outputs and clients Feb 15, 2023

ricardoV94 mentioned this pull request Feb 20, 2023

Fuse Composite Ops with CAReduce Ops in Numba/C backend #224

Open

ricardoV94 mentioned this pull request Mar 16, 2023

Fuse subgraphs that share common inputs #249

Open

ricardoV94 deleted the multiple_output_fusion branch June 21, 2023 08:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fuse `Elemwise` graphs that have multiple outputs and clients #121

Fuse `Elemwise` graphs that have multiple outputs and clients #121

ricardoV94 commented Dec 13, 2022 •

edited

Loading

ricardoV94 commented Dec 13, 2022

OriolAbril commented Dec 13, 2022

ricardoV94 commented Dec 14, 2022 •

edited

Loading

ricardoV94 commented Dec 14, 2022

ricardoV94 commented Dec 15, 2022 •

edited

Loading

aseyboldt commented Dec 15, 2022

michaelosthege Feb 7, 2023

michaelosthege Feb 7, 2023

ricardoV94 Feb 7, 2023 •

edited

Loading

michaelosthege Feb 7, 2023

ricardoV94 Feb 8, 2023

ricardoV94 commented Feb 8, 2023

michaelosthege left a comment

ricardoV94 commented Feb 8, 2023

ricardoV94 commented Feb 8, 2023

Fuse Elemwise graphs that have multiple outputs and clients #121

Fuse Elemwise graphs that have multiple outputs and clients #121

Conversation

ricardoV94 commented Dec 13, 2022 • edited Loading

TODO:

New features

Bugfixes

ricardoV94 commented Dec 13, 2022

OriolAbril commented Dec 13, 2022

ricardoV94 commented Dec 14, 2022 • edited Loading

ricardoV94 commented Dec 14, 2022

ricardoV94 commented Dec 15, 2022 • edited Loading

aseyboldt commented Dec 15, 2022

michaelosthege Feb 7, 2023

Choose a reason for hiding this comment

michaelosthege Feb 7, 2023

Choose a reason for hiding this comment

ricardoV94 Feb 7, 2023 • edited Loading

Choose a reason for hiding this comment

michaelosthege Feb 7, 2023

Choose a reason for hiding this comment

ricardoV94 Feb 8, 2023

Choose a reason for hiding this comment

ricardoV94 commented Feb 8, 2023

michaelosthege left a comment

Choose a reason for hiding this comment

ricardoV94 commented Feb 8, 2023

ricardoV94 commented Feb 8, 2023

Fuse `Elemwise` graphs that have multiple outputs and clients #121

Fuse `Elemwise` graphs that have multiple outputs and clients #121

ricardoV94 commented Dec 13, 2022 •

edited

Loading

ricardoV94 commented Dec 14, 2022 •

edited

Loading

ricardoV94 commented Dec 15, 2022 •

edited

Loading

ricardoV94 Feb 7, 2023 •

edited

Loading