Excessive LLVM time in egal codegen of large struct #54109

Keno · 2024-04-17T00:36:46Z

This is similar to #44998, in that LLVM's SLPVectorizer is involved, but I think it's easier to solve by tweaking the codegen for egal:

struct DefaultOr{T}
   x::T
   default::Bool
end

@eval struct Torture
    $((Expr(:(::), Symbol("x$i"), DefaultOr{Float64}) for i = 1:897)...)
end

egal_any(x::Torture, y::Any) = x === y

julia> @time code_llvm(egal_any, Tuple{Torture, Any})
 22.034327 seconds (5.48 M allocations: 206.847 MiB, 0.40% gc time, 88.69% compilation time: <1% of which was recompilation)

The text was updated successfully, but these errors were encountered:

vtjnash · 2024-04-17T00:44:19Z

I think Oscar was proposing making this code more branch-y, which should help defeat the vectorizer. All those undef padding bits otherwise get in the way of doing simple loops over the bits

gbaraldi · 2024-04-17T00:49:42Z

Does the padding stop us from emmiting a memcpy?

Keno · 2024-04-17T00:58:32Z

Yes, the padding is forcing us to emit this unrolled. I think a reasonable implementation here would be to RLE the padding bit pattern and then emit the compare as a sequence of loops with an early out between each block. That should allow the loop vectorizer to emit the correct target-specific comparison sequence for each bit pattern as well as giving it license to early out the loop, without forcing that semantically.

gbaraldi · 2024-04-17T02:18:06Z

We should probably vendor the expand memcmp code llvm has. Not sure if there is anything that we can annotate the loop to say, hey we don't care if you early/late exit this

The strategy here is to look at (data, padding) pairs and RLE them into loops, so that repeated adjacent patterns use a loop rather than getting unrolled. On the test case from #54109, this makes compilation essentially instant, while also being faster at runtime (turns out LLVM spends a massive amount of time AND the answer is bad). There's some obvious further enhancements possible here: 1. The `memcmp` constant is small. LLVM has a pass to inline these with better code. However, we don't have it turned on. We should consider vendoring it, though we may want to add some shorcutting to it to avoid having it iterate through each function. 2. This only does one level of sequence matching. It could be recursed to turn things into nested loops. However, this solves the immediate issue, so hopefully it's a useful start. Fixes #54109.

giordano added performance Must go faster compiler:llvm For issues that relate to LLVM compiler:codegen Generation of LLVM IR and native code labels Apr 17, 2024

Keno mentioned this issue Apr 17, 2024

Make emitted egal code more loopy #54121

Merged

Keno closed this as completed in 50833c8 Apr 25, 2024

Keno closed this as completed in #54121 Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive LLVM time in egal codegen of large struct #54109

Excessive LLVM time in egal codegen of large struct #54109

Keno commented Apr 17, 2024

vtjnash commented Apr 17, 2024

gbaraldi commented Apr 17, 2024

Keno commented Apr 17, 2024

gbaraldi commented Apr 17, 2024

Excessive LLVM time in egal codegen of large struct #54109

Excessive LLVM time in egal codegen of large struct #54109

Comments

Keno commented Apr 17, 2024

vtjnash commented Apr 17, 2024

gbaraldi commented Apr 17, 2024

Keno commented Apr 17, 2024

gbaraldi commented Apr 17, 2024