-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fuse Elemwise
graphs that have multiple outputs and clients
#121
Conversation
50471b6
to
af76f79
Compare
@OriolAbril Any idea why readthedocs is failing? |
I suspect github is laggy or partially down. The error was that readthedocs was unable to find the reference to this PR. I restarted the job and now it has been able to get the PR changes. 🤷🏿 |
66dc5e8
to
d074db1
Compare
3c175ad
to
91fb488
Compare
Yup, the slow and skip marks seem to be ignored in the CI #5 (comment) |
157ad77
to
4931a2d
Compare
234bb73
to
961e9d4
Compare
d5fa644
to
225a514
Compare
Alright first time the tests pass (minus the ones I disabled :D) |
Benchmark, running this snippet after this PR vs main: import pytensor
import pytensor.tensor as pt
import numpy as np
rng = np.random.default_rng(123)
size = 100_000
x = pytensor.shared(rng.normal(size=size), name="x")
mu = pytensor.shared(rng.normal(size=size), name="mu")
logp = (- ((x - mu) **2) / 2)
grad = pt.grad(logp.sum(), x)
func = pytensor.function([], [logp, grad])
pytensor.dprint(func)
%timeit -n 1000 func() This PR: 145 µs ± 15.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) Exact numbers change quite a bit when I rerun the script, but they don't overlap. The difference grows with size, and becomes way less noticeable (or vanishes) for smaller sizes. Would appreciate if someone else could replicate the benchmark. |
3f9b197
to
e889d2c
Compare
53e3a67
to
ceb53da
Compare
new_node = ret[0].owner | ||
else: | ||
break | ||
def apply(self, fgraph): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main loop looks much better, but the local functions make it still really hard to read, as it is ~450 lines in total.
Would it be feasible to extract these local functions and pass a few more things as args/kwargs if needed?
Your call though
("float32", "float32"), | ||
), | ||
marks=pytest.mark.xfail, # Not implemented yet | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want to sound nitpicky, but you're adding more lines to the parametrize
(74) than the test method has to begin with (61).
I mean sure, there were 600 lines of parameterize
before, but like... maybe some of these parametrize
items deserve their own test method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a rewrite that affects 99.9% of all graphs in a very fundamental manner, I don't think it's crazy that we would test so many parametrizations.
Besides autodiff, loop fusion is basically the other thing that PyTensor does that goes beyond vanilla Numpy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fully support having all these tests. I'm just questioning whether this is the most maintainable way to organize the test code.
For example, the test implementation could be extracted into a function, and some of the most involved parametrizations that have these 10ish levels of indentation could just be their own test methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... We can also use a generator to yield the parametrizations, but I am not sure that's much better? I went through the parametrizations and there is only one that seems too long. I rearranged it to have less nesting:
(
(
log(
ge(
assert_op(
at_abs(fx),
at_all(ge(at_abs(fx), 0)),
),
0,
)
),
),
(fx,),
(fxv,),
4,
(np.zeros_like(fxv),),
("float32",),
),
ceb53da
to
e64fc66
Compare
* Move local_add_mul_fusion to `rewriting/elemwise` and remove unused/duplicated TestAddMulFusion tests * Use EquilibriumGraphRewriter for local_add_mul_fusion * Do not register optional rewrites if tensor__local_elemwise_fusion flag is disabled
e64fc66
to
25e70e8
Compare
Ready for re-review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for cleaning up the code some more, @ricardoV94
I know that you invested a lot of attention on the details here. If there is anything where you're in doubt and need a careful review let me know. It's just too much for one pass..
Yeah, it's a though one to review. Thanks for the suggestions to make it a tiny bit more readable. I think I will merge it and iterate in future PRs |
Actually still have to clean the comments from one function... BRB |
This was not an issue in my local machine, but failed on the Github CI. It could be due to compiler optimizations. Case 69 used to look like this: ```python Elemwise{Composite{(i0 * tan(i0) * tan(i0) * i1)}} [id C] |x [id A] |x [id A] ``` And now looks like this ```python Elemwise{Composite{(i0 * tan(i0) * tan(i0) * i0)}} [id C] |x [id A] [None] ```
It doesn't make sense to include `fast_run` if `fast_compile` mode is being used. Some rewrites such as the FusionOptimizer are not compatible with `fast_compile` mode which prevents the creation of C thunks. The FusionOptimizer has no way of knowing this is the case, and assumes it is safe to return Composites with more than 32 operands, even though that's not the case with the Python perform method.
25e70e8
to
11c2b2c
Compare
Elemwise
graphs that have multiple outputs and clients
Continuation of aesara-devs/aesara#1242, as I've been blocked on that repo
More context can be found in aesara-devs/aesara#1237
Clarifying what is the idea behind allowing Composites with multiple clients. The following example:
Before this PR it produced this graph:
After:
The expected savings come from having to iterate only once over the data vector x, vs having to iterate 3 times before, once over x and twice over
exp(i0 * i1)
.Another example is the following graph:
Before:
After:
Again we replace 3 loops by 1.
For more detail, here is the C code of the Composite scalar in the first function:
You can see that it avoids recomputing the same sub-expressions in both cases. It does store more intermediate variables than needed, but hopefully the compiler will be able to get rid of those.
This would be further improved by inplacing the scalars as is being pursued in #107
TODO:
add_mul_fusion
elemwise_max_input_fct
New features
Bugfixes