-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Excessive LLVM time in large function #44998
Comments
The following comment applies to a timing.ll file that Keno generated from the above julia with some custom changes. He has explicitly stated the timing.ll file is to be considered public. I tried to attach it here, but am getting errors on the attempt.
-time-passes overwhelmingly shows the problem is in SLP.
I'm in the process of running perf now, but in the meantime, here are some observations on the IR itself. The IR is a repeating pattern that looks like this:
This is the optimized IR, but the initial IR is pretty similar, just with less canonicalized addressing. This pattern repeats several hundred (thousand?) times. This looks like an obvious candidate for loop rerolling. This pattern is writing the {8 byte random, 1 byte 1, 7 byte undef} to every 16 byte stride in a newly allocated array. There are 180 new arrays, each of which is so initialized. This looks like something we should be able to turn into either a loop nest, or at least a 180 long sequence of initialization loops. However, the calls to rand have been generated as calls to unique symbols, preventing that reroll. We could in theory teach the reroller to reroll via indirect function calls through an array of function pointer constants, but it would probably be better to explore why we need unique symbols to start with. Despite this pattern, SLP still shouldn't be taking so much time to do nothing. I'll update once perf report gets around to completing. :) |
Ok, here's some half baked thoughts on what's going on in the SLP code. We appear to start by trying to vectorize the stores of the newly allocated objects into the result buffer. (Not the initialization of the objects themselves.) This is fine, except that we then try to extend the scheduling window all the way back to the allocations. This includes every instruction in the basic block. We then spend all of our time trying to calculate dependencies for every instruction in the block. Moreover, we repeat this same rebuild of dependencies for each pair of stores to the result buffer. I'm currently a bit unclear as to why we need to reset the dependencies - as opposed to just the schedule - when moving between pairs of stores. We clearly do - I can see it happen - but I'm not sure on why? The worst part is that the vectorization ends up being unprofitable, so we don't get any benefit at all from this. :) |
That pattern is only present in the synthetic test case. For the real case, each of the calls is different (and most of them will be inlined, giving non-regular patterns). That said, I do not know why each of the symbols got uniqued. |
Ok, here's a slightly more complete sketch of what's going on in SLP vectorizer.
The more I look at this, the more I think the notion of a single scheduling window is just the wrong approach in SLP. In this case, I see a couple of alternatives:
I don't see any easy fixes here. I'm going to give this some more thought, but we might be stuck here unless we're willing to do a wholesale rewrite of SLP. |
Currently, every not-previously-emitted reference to a julia function gets a unique new name when we generate LLVM ir and we resolve all those names later when we actually emit the referenced function. This causes confusion in LLVM IR output (e.g. in #44998, where we had tens of thousands of unique names for the exact same function). It doesn't so much matter for the JIT, since the references get merged before the JIT runs, but for output to IR, this change will make the result much nicer.
Currently, every not-previously-emitted reference to a julia function gets a unique new name when we generate LLVM ir and we resolve all those names later when we actually emit the referenced function. This causes confusion in LLVM IR output (e.g. in #44998, where we had tens of thousands of unique names for the exact same function). It doesn't so much matter for the JIT, since the references get merged before the JIT runs, but for output to IR, this change will make the result much nicer.
The following function takes about 1.5 seconds to infer and more than 15 minutes to LLVM optimize:
It is a very large function, but this amount of LLVM time is excessive. Most of it is spend in various memory queries. I think we can significantly speed up LLVM time by being slightly smarter in how we emit this code.
The text was updated successfully, but these errors were encountered: