Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Precompiling and FunctionWrappers lead to extra runtime allocations #54832

Open
amilsted opened this issue Jun 17, 2024 · 8 comments
Open

Precompiling and FunctionWrappers lead to extra runtime allocations #54832

amilsted opened this issue Jun 17, 2024 · 8 comments

Comments

@amilsted
Copy link
Contributor

amilsted commented Jun 17, 2024

Precompiling a function that constructs and calls FunctionWrappers seems to make each call to the FunctionWrapper allocate.

Here's my test package:

module FWAllocs

using PrecompileTools
using FunctionWrappers

function test_precomp(N)
    fw = FunctionWrappers.FunctionWrapper{Float64, Tuple{Float64}}(cos)
    res = 0.0
    for i in 1:N
        res += fw(i*0.1)
    end
    return res
end

function test_no_precomp(N)
    fw = FunctionWrappers.FunctionWrapper{Float64, Tuple{Float64}}(cos)
    res = 0.0
    for i in 1:N
        res += fw(i*0.1)
    end
    return res
end

@setup_workload begin
    @compile_workload begin
        test_precomp(10)
    end
end

end

and here's what happens on Julia 1.10.4:

julia> using FWAllocs

julia> @time FWAllocs.test_precomp(10)
  0.053401 seconds (96.14 k allocations: 5.794 MiB, 99.86% compilation time)
8.177847573818267

julia> @time FWAllocs.test_precomp(10)
  0.000004 seconds (22 allocations: 376 bytes)
8.177847573818267

julia> @time FWAllocs.test_precomp(20)
  0.000011 seconds (42 allocations: 696 bytes)
8.377322108212503

julia> @time FWAllocs.test_no_precomp(20)
  0.000007 seconds (2 allocations: 56 bytes)
8.377322108212503

julia> 

Also, is it odd that test_precomp needs to compile on first use, but test_no_precomp does not?

Originally posted by @amilsted in #35972 (comment)

Confirmed on nightly by @kimikage: #35972 (comment)

@kimikage
Copy link
Contributor

Just FYI, this problem is reproduced by using precompile(test_precomp, (Int,)) directly without PrecompileTools.

Also, for i in 1:N is necessary to ensure that the allocation occurs with each call, but is not necessary to reproduce the problem itself.

@kimikage
Copy link
Contributor

kimikage commented Jun 17, 2024

We now have PkgCacheInspector.jl. However, I could not find the cause.

Edit:
SnoopCompile.@snoopi_deep also helps to understand what is happening.

@kimikage
Copy link
Contributor

kimikage commented Jun 17, 2024

Despite the use of function pointers, this problem does not occur in the following simple case.

function test_precomp()
    ptr = @cfunction(cos, Float64, (Float64,))
    ccall(ptr, Float64, (Float64,), 0.0)
end

Edit:
So, the generated function do_call() might be relevant.
https://github.com/yuyichao/FunctionWrappers.jl/blob/ad1cea6fd36a7e72c2755efacd4f28b52fbb1f6a/src/FunctionWrappers.jl#L125-L142

Edit2:
There is a change (PR yuyichao/FunctionWrappers.jl#31) in FunctionWrappers#master, but at least this problem still occurs.

@amilsted
Copy link
Contributor Author

amilsted commented Jun 17, 2024

In case it's a clue: I see allocations like this from FunctionWrappers used in precompiled code in a large codebase (the allocations go away if I don't precompile), but only on Apple Silicon (unlike the minimal case above). For some reason, the problem goes away on x86. Perhaps related to yuyichao/FunctionWrappers.jl#30?

@NHDaly
Copy link
Member

NHDaly commented Sep 15, 2024

This looks potentially related - i'm not sure if it's entirely the same. If not, we can extract it into a separate ticket, but it sounds very similar to what you're describing.

Here is an MRE that somehow assigning the result of a @cfunction function pointer to a global constant seems to somehow intrusively transform it into a variant that allocates:

module TestFP

mutable struct Record
  x::Int
end

Base.@ccallable function update_record!(r::Record)::Int
    r.x += 1
end
make_fp1() = @cfunction(update_record!, Int, (Any,))
make_fp2() = @cfunction(update_record!, Int, (Any,))

# Call both functions, so that the difference isn't in which ones are _called_:
@time make_fp1()
@time make_fp2()

mutable struct PointerHolder
    ptr::Ptr{Cvoid}
end

# Somehow, assigning the @cfunction constructor at package precompile time to a const value
# has a destructive side-effect on the performance of calls through the pointer returned by
# make_fp2(). Calling through the result of make_fp2() causes dynamic dispatch, while
# make_fp1() does not.
const FP2 = PointerHolder(make_fp2())  # (Even though this will be 0'd out during serialization)

end # module TestFP
julia> using TestFP, BenchmarkTools
Info Given TestFP was explicitly requested, output will be shown live 
  0.000000 seconds
  0.000000 seconds
Precompiling TestFP finished.
  1 dependency successfully precompiled in 1 seconds
  1 dependency had output during precompilation:
┌ TestFP
│  [Output was shown above]
└  
@
julia> @btime ccall($(TestFP.make_fp1()), Int, (Any,), $(TestFP.Record(0)))
  4.708 ns (0 allocations: 0 bytes)
500502

julia> @btime ccall($(TestFP.make_fp2()), Int, (Any,), $(TestFP.Record(0)))
  15.348 ns (1 allocation: 16 bytes)
500502

@timholy
Copy link
Member

timholy commented Sep 16, 2024

Might be worth checking for [unknown stackframe] in profile trees as in #50749. That would be a sign of something not getting cached.

@NHDaly
Copy link
Member

NHDaly commented Sep 16, 2024

Oh, yes, I do see that! 👍 What do I take away from that?

@timholy
Copy link
Member

timholy commented Sep 17, 2024

Discovering any sort of pattern about what is getting omitted might help. My first thought would be to try some of the debugging that starts here: #35972 (comment)

NHDaly added a commit to NHDaly/DispatchExperiments.jl that referenced this issue Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants