Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive LLVM time in large function #44998

Open
Keno opened this issue Apr 15, 2022 · 4 comments
Open

Excessive LLVM time in large function #44998

Keno opened this issue Apr 15, 2022 · 4 comments
Labels
compiler:codegen Generation of LLVM IR and native code compiler:llvm For issues that relate to LLVM performance Must go faster

Comments

@Keno
Copy link
Member

Keno commented Apr 15, 2022

The following function takes about 1.5 seconds to infer and more than 15 minutes to LLVM optimize:

const NENTRIES = 300
const NTUPLES = 180
struct WrappedTuple
   x::Tuple
end

struct DefaultOr{T}
   x::T
   default::Bool
end

@noinline noinline_rand() = rand()
@eval function torture()
    $(Expr(:tuple, (Expr(:call, WrappedTuple, Expr(:tuple, (:(DefaultOr(noinline_rand(), true)) for i = 1:NENTRIES)...)) for i = 1:NTUPLES)...))
end

It is a very large function, but this amount of LLVM time is excessive. Most of it is spend in various memory queries. I think we can significantly speed up LLVM time by being slightly smarter in how we emit this code.

@oscardssmith oscardssmith added performance Must go faster compiler:codegen Generation of LLVM IR and native code compiler:llvm For issues that relate to LLVM labels Apr 15, 2022
@preames
Copy link

preames commented Apr 22, 2022

The following comment applies to a timing.ll file that Keno generated from the above julia with some custom changes. He has explicitly stated the timing.ll file is to be considered public. I tried to attach it here, but am getting errors on the attempt.

$ time ./opt -O2 ~/client-data/julia/2022-04-22/timing.ll -S -o ~/client-data/julia/2022-04-22/timing.O2-out.ll
real	1m49.416s
user	1m48.283s
sys	0m0.588s

-time-passes overwhelmingly shows the problem is in SLP.

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  98.2131 ( 86.3%)   0.4037 ( 65.3%)  98.6168 ( 86.2%)  99.1410 ( 86.2%)  SLPVectorizerPass
   4.0994 (  3.6%)   0.0000 (  0.0%)   4.0995 (  3.6%)   4.1038 (  3.6%)  InstCombinePass
   3.5153 (  3.1%)   0.0479 (  7.8%)   3.5632 (  3.1%)   3.5679 (  3.1%)  ModuleInlinerWrapperPass
   3.2883 (  2.9%)   0.0479 (  7.8%)   3.3363 (  2.9%)   3.3405 (  2.9%)  DevirtSCCRepeatedPass

I'm in the process of running perf now, but in the meantime, here are some observations on the IR itself.

The IR is a repeating pattern that looks like this:

  ;; copy one
  %15 = tail call double @j_noinline_rand_78() #4, !dbg !19
  %16 = getelementptr inbounds i8, ptr %4, i64 64, !dbg !20
  store double %15, ptr %16, align 8, !dbg !20, !tbaa !25
  %17 = getelementptr inbounds i8, ptr %4, i64 72, !dbg !20
  store i8 1, ptr %17, align 1, !dbg !20, !tbaa !25

  ;; copy two
  %18 = tail call double @j_noinline_rand_79() #4, !dbg !19
  %19 = getelementptr inbounds i8, ptr %4, i64 80, !dbg !20
  store double %18, ptr %19, align 8, !dbg !20, !tbaa !25
  %20 = getelementptr inbounds i8, ptr %4, i64 88, !dbg !20
  store i8 1, ptr %20, align 1, !dbg !20, !tbaa !25

This is the optimized IR, but the initial IR is pretty similar, just with less canonicalized addressing.

This pattern repeats several hundred (thousand?) times.

This looks like an obvious candidate for loop rerolling. This pattern is writing the {8 byte random, 1 byte 1, 7 byte undef} to every 16 byte stride in a newly allocated array. There are 180 new arrays, each of which is so initialized. This looks like something we should be able to turn into either a loop nest, or at least a 180 long sequence of initialization loops.

However, the calls to rand have been generated as calls to unique symbols, preventing that reroll.

We could in theory teach the reroller to reroll via indirect function calls through an array of function pointer constants, but it would probably be better to explore why we need unique symbols to start with.

Despite this pattern, SLP still shouldn't be taking so much time to do nothing. I'll update once perf report gets around to completing. :)

@preames
Copy link

preames commented Apr 22, 2022

Ok, here's some half baked thoughts on what's going on in the SLP code.

We appear to start by trying to vectorize the stores of the newly allocated objects into the result buffer. (Not the initialization of the objects themselves.) This is fine, except that we then try to extend the scheduling window all the way back to the allocations. This includes every instruction in the basic block. We then spend all of our time trying to calculate dependencies for every instruction in the block.

Moreover, we repeat this same rebuild of dependencies for each pair of stores to the result buffer. I'm currently a bit unclear as to why we need to reset the dependencies - as opposed to just the schedule - when moving between pairs of stores. We clearly do - I can see it happen - but I'm not sure on why?

The worst part is that the vectorization ends up being unprofitable, so we don't get any benefit at all from this. :)

@Keno
Copy link
Member Author

Keno commented Apr 22, 2022

This looks like an obvious candidate for loop rerolling. This pattern is writing the result of rand to every 16 byte stride in an array, and writing the value 1 in the intermediate bytes. However, the calls to rand have been generated as calls to unique symbols, preventing that reroll.

That pattern is only present in the synthetic test case. For the real case, each of the calls is different (and most of them will be inlined, giving non-regular patterns). That said, I do not know why each of the symbols got uniqued.

@preames
Copy link

preames commented Apr 22, 2022

Ok, here's a slightly more complete sketch of what's going on in SLP vectorizer.

for each store pair:
  throw away scheduling region
  recreate scheduling region for pair of stores
  check schedulability of stores - OK (this is cheap)
  recurse on operands to stores (e.g. the newly created objects)
  extend scheduling region to object creation - this forces dependency calculation for *all* instructions between first object creation and store
  recognize that pair of new objects are not schedulable
  cost vectorization of stores on their own - result unprofitable, discard and continue

The more I look at this, the more I think the notion of a single scheduling window is just the wrong approach in SLP. In this case, I see a couple of alternatives:

  • If we had two independent scheduling windows, we could prove inability to schedule the allocation bundle without scanning all the instructions between the bundles.
  • If we could keep dependency data across pairs (via some "max scheduling window"), we could avoid the dependency recompute cost. This introduces a new failure mode though when the actual scheduling window we're interested in is small, but it overlaps with some previous large window. Maybe we should just give up and compute dependence for the whole block at once? Or use some exponential expansion scheme to avoid degenerate reschedule on extend down cases?

I don't see any easy fixes here. I'm going to give this some more thought, but we might be stuck here unless we're willing to do a wholesale rewrite of SLP.

Keno added a commit that referenced this issue Apr 25, 2022
Currently, every not-previously-emitted reference to a julia function
gets a unique new name when we generate LLVM ir and we resolve all
those names later when we actually emit the referenced function.
This causes confusion in LLVM IR output (e.g. in #44998, where
we had tens of thousands of unique names for the exact same
function). It doesn't so much matter for the JIT, since the
references get merged before the JIT runs, but for output to
IR, this change will make the result much nicer.
Keno added a commit that referenced this issue Apr 26, 2022
Currently, every not-previously-emitted reference to a julia function
gets a unique new name when we generate LLVM ir and we resolve all
those names later when we actually emit the referenced function.
This causes confusion in LLVM IR output (e.g. in #44998, where
we had tens of thousands of unique names for the exact same
function). It doesn't so much matter for the JIT, since the
references get merged before the JIT runs, but for output to
IR, this change will make the result much nicer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:codegen Generation of LLVM IR and native code compiler:llvm For issues that relate to LLVM performance Must go faster
Projects
None yet
Development

No branches or pull requests

3 participants