Excessive LLVM time in large function #44998

Keno · 2022-04-15T19:54:27Z

The following function takes about 1.5 seconds to infer and more than 15 minutes to LLVM optimize:

const NENTRIES = 300
const NTUPLES = 180
struct WrappedTuple
   x::Tuple
end

struct DefaultOr{T}
   x::T
   default::Bool
end

@noinline noinline_rand() = rand()
@eval function torture()
    $(Expr(:tuple, (Expr(:call, WrappedTuple, Expr(:tuple, (:(DefaultOr(noinline_rand(), true)) for i = 1:NENTRIES)...)) for i = 1:NTUPLES)...))
end

It is a very large function, but this amount of LLVM time is excessive. Most of it is spend in various memory queries. I think we can significantly speed up LLVM time by being slightly smarter in how we emit this code.

The text was updated successfully, but these errors were encountered:

preames · 2022-04-22T18:56:04Z

The following comment applies to a timing.ll file that Keno generated from the above julia with some custom changes. He has explicitly stated the timing.ll file is to be considered public. I tried to attach it here, but am getting errors on the attempt.

$ time ./opt -O2 ~/client-data/julia/2022-04-22/timing.ll -S -o ~/client-data/julia/2022-04-22/timing.O2-out.ll
real	1m49.416s
user	1m48.283s
sys	0m0.588s

-time-passes overwhelmingly shows the problem is in SLP.

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  98.2131 ( 86.3%)   0.4037 ( 65.3%)  98.6168 ( 86.2%)  99.1410 ( 86.2%)  SLPVectorizerPass
   4.0994 (  3.6%)   0.0000 (  0.0%)   4.0995 (  3.6%)   4.1038 (  3.6%)  InstCombinePass
   3.5153 (  3.1%)   0.0479 (  7.8%)   3.5632 (  3.1%)   3.5679 (  3.1%)  ModuleInlinerWrapperPass
   3.2883 (  2.9%)   0.0479 (  7.8%)   3.3363 (  2.9%)   3.3405 (  2.9%)  DevirtSCCRepeatedPass

I'm in the process of running perf now, but in the meantime, here are some observations on the IR itself.

The IR is a repeating pattern that looks like this:

  ;; copy one
  %15 = tail call double @j_noinline_rand_78() #4, !dbg !19
  %16 = getelementptr inbounds i8, ptr %4, i64 64, !dbg !20
  store double %15, ptr %16, align 8, !dbg !20, !tbaa !25
  %17 = getelementptr inbounds i8, ptr %4, i64 72, !dbg !20
  store i8 1, ptr %17, align 1, !dbg !20, !tbaa !25

  ;; copy two
  %18 = tail call double @j_noinline_rand_79() #4, !dbg !19
  %19 = getelementptr inbounds i8, ptr %4, i64 80, !dbg !20
  store double %18, ptr %19, align 8, !dbg !20, !tbaa !25
  %20 = getelementptr inbounds i8, ptr %4, i64 88, !dbg !20
  store i8 1, ptr %20, align 1, !dbg !20, !tbaa !25

This is the optimized IR, but the initial IR is pretty similar, just with less canonicalized addressing.

This pattern repeats several hundred (thousand?) times.

This looks like an obvious candidate for loop rerolling. This pattern is writing the {8 byte random, 1 byte 1, 7 byte undef} to every 16 byte stride in a newly allocated array. There are 180 new arrays, each of which is so initialized. This looks like something we should be able to turn into either a loop nest, or at least a 180 long sequence of initialization loops.

However, the calls to rand have been generated as calls to unique symbols, preventing that reroll.

We could in theory teach the reroller to reroll via indirect function calls through an array of function pointer constants, but it would probably be better to explore why we need unique symbols to start with.

Despite this pattern, SLP still shouldn't be taking so much time to do nothing. I'll update once perf report gets around to completing. :)

preames · 2022-04-22T19:53:32Z

Ok, here's some half baked thoughts on what's going on in the SLP code.

We appear to start by trying to vectorize the stores of the newly allocated objects into the result buffer. (Not the initialization of the objects themselves.) This is fine, except that we then try to extend the scheduling window all the way back to the allocations. This includes every instruction in the basic block. We then spend all of our time trying to calculate dependencies for every instruction in the block.

Moreover, we repeat this same rebuild of dependencies for each pair of stores to the result buffer. I'm currently a bit unclear as to why we need to reset the dependencies - as opposed to just the schedule - when moving between pairs of stores. We clearly do - I can see it happen - but I'm not sure on why?

The worst part is that the vectorization ends up being unprofitable, so we don't get any benefit at all from this. :)

Keno · 2022-04-22T20:17:15Z

This looks like an obvious candidate for loop rerolling. This pattern is writing the result of rand to every 16 byte stride in an array, and writing the value 1 in the intermediate bytes. However, the calls to rand have been generated as calls to unique symbols, preventing that reroll.

That pattern is only present in the synthetic test case. For the real case, each of the calls is different (and most of them will be inlined, giving non-regular patterns). That said, I do not know why each of the symbols got uniqued.

preames · 2022-04-22T20:19:49Z

Ok, here's a slightly more complete sketch of what's going on in SLP vectorizer.

for each store pair:
  throw away scheduling region
  recreate scheduling region for pair of stores
  check schedulability of stores - OK (this is cheap)
  recurse on operands to stores (e.g. the newly created objects)
  extend scheduling region to object creation - this forces dependency calculation for *all* instructions between first object creation and store
  recognize that pair of new objects are not schedulable
  cost vectorization of stores on their own - result unprofitable, discard and continue

The more I look at this, the more I think the notion of a single scheduling window is just the wrong approach in SLP. In this case, I see a couple of alternatives:

If we had two independent scheduling windows, we could prove inability to schedule the allocation bundle without scanning all the instructions between the bundles.
If we could keep dependency data across pairs (via some "max scheduling window"), we could avoid the dependency recompute cost. This introduces a new failure mode though when the actual scheduling window we're interested in is small, but it overlaps with some previous large window. Maybe we should just give up and compute dependence for the whole block at once? Or use some exponential expansion scheme to avoid degenerate reschedule on extend down cases?

I don't see any easy fixes here. I'm going to give this some more thought, but we might be stuck here unless we're willing to do a wholesale rewrite of SLP.

Currently, every not-previously-emitted reference to a julia function gets a unique new name when we generate LLVM ir and we resolve all those names later when we actually emit the referenced function. This causes confusion in LLVM IR output (e.g. in #44998, where we had tens of thousands of unique names for the exact same function). It doesn't so much matter for the JIT, since the references get merged before the JIT runs, but for output to IR, this change will make the result much nicer.

oscardssmith added performance Must go faster compiler:codegen Generation of LLVM IR and native code compiler:llvm For issues that relate to LLVM labels Apr 15, 2022

This was referenced Apr 25, 2022

Clean up duplicated calls to the same target in LLVM output #45091

Merged

Hoist object allocation before inner field initialization #45093

Closed

Keno mentioned this issue Apr 17, 2024

Excessive LLVM time in egal codegen of large struct #54109

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive LLVM time in large function #44998

Excessive LLVM time in large function #44998

Keno commented Apr 15, 2022

preames commented Apr 22, 2022 •

edited

Loading

preames commented Apr 22, 2022

Keno commented Apr 22, 2022

preames commented Apr 22, 2022 •

edited

Loading

Excessive LLVM time in large function #44998

Excessive LLVM time in large function #44998

Comments

Keno commented Apr 15, 2022

preames commented Apr 22, 2022 • edited Loading

preames commented Apr 22, 2022

Keno commented Apr 22, 2022

preames commented Apr 22, 2022 • edited Loading

preames commented Apr 22, 2022 •

edited

Loading

preames commented Apr 22, 2022 •

edited

Loading