Fuse consecutive `Elemwise` subgraphs with multiple clients #1242

ricardoV94 · 2022-10-06T16:00:39Z

Todo

ricardoV94 · 2022-10-06T19:37:58Z

The TestFusion.test_big_fusion seems to run considerably slower, suggesting my first attempt may be too expensive for large graphs.

Before:

After:

Edit: With the latest changes it is now more reasonable:

brandonwillard · 2022-10-08T00:14:58Z

The TestFusion.test_big_fusion seems to run considerably slower, suggesting my first attempt may be too expensive for large graphs.

It's hard to tell what's going on using those numbers alone. For example, the extra time could be spent in compilation, and the run-time could be significantly reduced. Regardless, the difference is alarming.

Situations like this are another reason we should get #718 in place sooner than later.

ricardoV94 · 2022-10-10T07:51:00Z

The logic for inplacing will have to be rethought, as some inplaced outputs could overwrite inputs that are still needed for other outputs. Basically we will need something that reasons about the inner graph like we do for the general function.

Edit: For now I just restricted inplace to single-output Composites

ricardoV94 · 2022-10-11T11:26:21Z

Another more interesting issue I am finding is some FunctionGraph.replace_all_validate that are probably leading to cyclical dependencies (optimization just hangs as does dprinting the graph). This happens with tensor.test_basic.test_tile.

Edit: It was a bug in the subgraph algorithm. Fixed!

The same job is done by canonicalize before this rewrite is ever called.

ricardoV94 · 2022-10-18T12:49:03Z

This seems to be now working (more often than not) on the C-backend. It provides less speedups than I was expecting:

import aesara
import aesara.tensor as at
import numpy as np

x = at.dvector("x")
mu = at.dvector("mu")
logp = (- ((x - mu) **2) / 2)
grad = at.grad(logp.sum(), x)
func = aesara.function([mu, x], [logp, grad])
func.trust_input = True
aesara.dprint(func)

rng = np.random.default_rng(123)
size = 100_000
xv = rng.normal(size=size)
muv = rng.normal(size=size)

%timeit func(xv, muv)

The speedup depends on the size.

size=1,000: the old method is 1.11x slower
size=10,000: the old method is 1.18x slower
size=100,000: the old method is 1.34x slower
size=1,000,000: the old method is 1.45x slower

I couldn't test the effects on the Numba backend, because mulit-output Elemwises are disabled (we could test https://numba.pydata.org/numba-doc/latest/user/vectorize.html#the-guvectorize-decorator).

The JAX backend also errors out but I didn't investigate why yet.

@brandonwillard do you know of an easy way to retrieve the c_code generated by the whole function? Would help to see what it is trying to do inside the Elemwise

brandonwillard · 2022-10-18T17:41:27Z

@brandonwillard do you know of an easy way to retrieve the c_code generated by the whole function? Would help to see what it is trying to do inside the Elemwise

Which function exactly? All the C code generated during an aesara.function call? I believe one can get the cache/compiled module paths from the _CThunk objects in the Function returned by aesara.function, and the C source files are in those directories.

brandonwillard · 2022-10-18T17:50:04Z

This seems to be now working (more often than not) on the C-backend. It provides less speedups than I was expecting:

It's possible that this new feature has to sometimes trade off between the benefits of "merging"/CSE and fusion. Your example in #1237 illustrates this possibility with the exp node; that's why we should first clarify how we expect fusion to work in these scenarios (#1237 (comment)).

ricardoV94 · 2022-10-19T08:07:55Z

@brandonwillard I extended the motivation behind this PR in the original issue: #1237 (comment)

Otherwise they fail due to lack of support for multi-output Elemwises in the Numba backend

ricardoV94 force-pushed the multiple_output_composite branch 2 times, most recently from 074f125 to fff0d3c Compare October 6, 2022 17:41

ricardoV94 changed the title ~~Fuse consecutive Elemwise subgraphs with multiple outputs~~ Fuse consecutive Elemwise subgraphs with multiple clients Oct 6, 2022

ricardoV94 force-pushed the multiple_output_composite branch 3 times, most recently from c84e827 to fb6abe0 Compare October 6, 2022 19:16

brandonwillard added graph rewriting performance concern labels Oct 6, 2022

ricardoV94 force-pushed the multiple_output_composite branch from fb6abe0 to cbd2c61 Compare October 6, 2022 19:25

ricardoV94 force-pushed the multiple_output_composite branch 6 times, most recently from d4301b1 to 0246fce Compare October 8, 2022 00:11

ricardoV94 force-pushed the multiple_output_composite branch 2 times, most recently from 76b30b4 to f000323 Compare October 11, 2022 16:12

Ricardo Vieira added 3 commits October 11, 2022 18:12

Do not create fuse_seqopt optionally

4d92113

Remove redundant add_mul_fusion

cd04893

The same job is done by canonicalize before this rewrite is ever called.

Add test for nested broadcasted Composite graphs

e2d80eb

ricardoV94 force-pushed the multiple_output_composite branch from f000323 to 618c11c Compare October 11, 2022 16:12

brandonwillard added the important label Oct 11, 2022

brandonwillard changed the title ~~Fuse consecutive Elemwise subgraphs with multiple clients~~ Fuse consecutive Elemwise subgraphs with multiple clients Oct 11, 2022

ricardoV94 force-pushed the multiple_output_composite branch 3 times, most recently from 762061b to e70ea5f Compare October 18, 2022 08:32

ricardoV94 force-pushed the multiple_output_composite branch 3 times, most recently from 233f68b to 1bfbf24 Compare October 18, 2022 12:42

ricardoV94 force-pushed the multiple_output_composite branch 3 times, most recently from ddc83a4 to e00c125 Compare October 18, 2022 15:02

ricardoV94 force-pushed the multiple_output_composite branch from e00c125 to e518d4b Compare October 19, 2022 08:40

Ricardo Vieira added 4 commits October 27, 2022 20:29

Fix bug in Composite when multiple outputs are identical

cc409b8

Temporarily disable inplace for multiple-output Composites

fe9c347

Fuse consecutive Elemwise subgraphs with multiple clients

d981df8

Temporarily exclude fusion rewrite from Numba Scan tests

1302b49

Otherwise they fail due to lack of support for multi-output Elemwises in the Numba backend

ricardoV94 force-pushed the multiple_output_composite branch from e078f4e to 1302b49 Compare October 27, 2022 18:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fuse consecutive `Elemwise` subgraphs with multiple clients #1242

Fuse consecutive `Elemwise` subgraphs with multiple clients #1242

ricardoV94 commented Oct 6, 2022 •

edited

Loading

ricardoV94 commented Oct 6, 2022 •

edited

Loading

brandonwillard commented Oct 8, 2022 •

edited

Loading

ricardoV94 commented Oct 10, 2022 •

edited

Loading

ricardoV94 commented Oct 11, 2022 •

edited

Loading

ricardoV94 commented Oct 18, 2022 •

edited

Loading

brandonwillard commented Oct 18, 2022

brandonwillard commented Oct 18, 2022

ricardoV94 commented Oct 19, 2022

Fuse consecutive Elemwise subgraphs with multiple clients #1242

Are you sure you want to change the base?

Fuse consecutive Elemwise subgraphs with multiple clients #1242

Conversation

ricardoV94 commented Oct 6, 2022 • edited Loading

Todo

ricardoV94 commented Oct 6, 2022 • edited Loading

brandonwillard commented Oct 8, 2022 • edited Loading

ricardoV94 commented Oct 10, 2022 • edited Loading

ricardoV94 commented Oct 11, 2022 • edited Loading

ricardoV94 commented Oct 18, 2022 • edited Loading

brandonwillard commented Oct 18, 2022

brandonwillard commented Oct 18, 2022

ricardoV94 commented Oct 19, 2022

Fuse consecutive `Elemwise` subgraphs with multiple clients #1242

Fuse consecutive `Elemwise` subgraphs with multiple clients #1242

ricardoV94 commented Oct 6, 2022 •

edited

Loading

ricardoV94 commented Oct 6, 2022 •

edited

Loading

brandonwillard commented Oct 8, 2022 •

edited

Loading

ricardoV94 commented Oct 10, 2022 •

edited

Loading

ricardoV94 commented Oct 11, 2022 •

edited

Loading

ricardoV94 commented Oct 18, 2022 •

edited

Loading