Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Inference] limit inference timing recording to NativeInterpreter only #49391

Merged
merged 4 commits into from
Apr 25, 2023

Conversation

vchuravy
Copy link
Member

@vchuravy vchuravy commented Apr 17, 2023

Allows consumers to separate native inference from inference coming
from a custom AbstractInterpreter.

x-ref: timholy/SnoopCompile.jl#338

I choose typeof(interp) since the interp may be arbitrarily heavy and we might not want to hold on to them.

@vchuravy vchuravy added compiler:inference Type inference backport 1.9 Change should be backported to release-1.9 caching labels Apr 17, 2023
Copy link
Member

@maleadt maleadt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

@aviatesk
Copy link
Member

Alternatively we can do something like:

diff --git a/base/compiler/typeinfer.jl b/base/compiler/typeinfer.jl
index 1eec73d043..eb88a2081e 100644
--- a/base/compiler/typeinfer.jl
+++ b/base/compiler/typeinfer.jl
@@ -205,8 +205,7 @@ __set_measure_typeinf(onoff::Bool) = __measure_typeinf__[] = onoff
 const __measure_typeinf__ = fill(false)
 
 # Wrapper around _typeinf that optionally records the exclusive time for each invocation.
-function typeinf(interp::AbstractInterpreter, frame::InferenceState)
-    interp = switch_from_irinterp(interp)
+function typeinf(interp::NativeInterpreter, frame::InferenceState)
     if __measure_typeinf__[]
         Timings.enter_new_timer(frame)
         v = _typeinf(interp, frame)
@@ -216,6 +215,8 @@ function typeinf(interp::AbstractInterpreter, frame::InferenceState)
         return _typeinf(interp, frame)
     end
 end
+# disable recording timings for external `AbstractInterpreter`s
+typeinf(interp::AbstractInterpreter, frame::InferenceState) = _typeinf(interp, frame)
 
 function finish!(interp::AbstractInterpreter, caller::InferenceResult)
     # If we didn't transform the src for caching, we may have to transform
@@ -242,6 +243,7 @@ function finish!(interp::AbstractInterpreter, caller::InferenceResult)
 end
 
 function _typeinf(interp::AbstractInterpreter, frame::InferenceState)
+    interp = switch_from_irinterp(interp)
     typeinf_nocycle(interp, frame) || return false # frame is now part of a higher cycle
     # with no active ip's, frame is done
     frames = frame.callers_in_cycle

?

@vchuravy
Copy link
Member Author

Alternatively we can do something like:

I thought about that, but I actually want to be able to see how much time it took to infer CUDA code.

@vchuravy vchuravy added the gpu Affects running Julia on a GPU label Apr 18, 2023
@aviatesk
Copy link
Member

I thought about that, but I actually want to be able to see how much time it took to infer CUDA code.

GPUCompiler can implement its own measurement mechanism for that?

I'm not opposed to the current approach, but it seems somewhat risky to leave Core.Compiler.Timings to support an external AbstractInterpreters. The reason being that the external AbstractInterpreter's compilation pipeline may trigger native compilation, which could invalidate the current assumption of being able to find children/parent frames. So I believe it would be cleaner and more correct to have Core.Compiler.Timings only support NativeInterpreter by default, and have any external AbstractInterpreter optionally use the data structure of Core.Compiler.Timings to inspect its own compilation performance.

@vchuravy
Copy link
Member Author

The major reason would be that it would be great if SnoopCompile could be extended to custom abstract interpreters.
For me that's the next frontier in TTFX. Oceananigans has 10+ minutes of compilation time.

@vchuravy
Copy link
Member Author

The reason being that the external AbstractInterpreter's compilation pipeline may trigger native compilation, which could invalidate the current assumption of being able to find children/parent frames.

I don't know why that would be the case? I would say it is correct? You might have arbitrary interleavings?

@aviatesk
Copy link
Member

aviatesk commented Apr 18, 2023

I don't know why that would be the case? I would say it is correct? You might have arbitrary interleavings?

It's possible that there are type unstable parts within some external AbstractInterpreter implementation, and then invoking the external compilation pipeline may trigger dynamic dispatch to native-compile the compilation pipeline itself.

@aviatesk
Copy link
Member

The major reason would be that it would be great if SnoopCompile could be extended to custom abstract interpreters.
For me that's the next frontier in TTFX. Oceananigans has 10+ minutes of compilation time.

Yeah, I agree. But would it be cleaner if external AbstractInterpreter uses SnoopCompile utilities like this?

const gpu_timings = CC.Timings.Timing[]

function CC.typeinf(interp::GPUInterpreter, frame::InferenceState)
    [ push timing into gpu_timings ]
    @invoke CC.typeinf(interp::AbstractInterpreter, frame::InferenceState)
end

macro snoopi_deep_gpu(ex) _snoopi_deep_gpu(ex) end

function _snoopi_deep_gpu(cmd::Expr)
    return quote
        start_gpu_deep_timing()
        try
            $(esc(cmd))
        finally
            stop_gpu_deep_timing()
        end
        CC.Timings.InferenceTimingNode(gpu_timings[1])
    end 
end

@vchuravy
Copy link
Member Author

vchuravy commented Apr 18, 2023 via email

@aviatesk
Copy link
Member

Leaving a comment from slack for future reference:

Shuhei Kadowaki
I think it is problematic that NativeInterpreter and external AbstractInterpreters share the same data structure (Core.Compiler.Timings._timings), since Core.Compiler.Timings seems to assume that the inference graph is constructed by the same interpreter.

E.g.

# Prepare to unwind one level of the stack and record in the parent
parent_timer = _timings[end]
finds the parent frame just by looking at _timings[end], and currently we may get an inference graph like:

Timing(::GPUInterpreter, MethodInstance for xxx, ...)
 Timing(::GPUInterpreter, MethodInstance for yyy, ...)
  Timing(::NativeInterpreter, MethodInstance for GPUCompiler.abstract_eval_zzz, ...)
   Timing(::NativeInterpreter, MethodInstance for GPUCompiler.some_utility_for_abstract_eval_zzz, ...)
  Timing(::GPUInterpreter, MethodInstance for zzz, ...)

, which seems to be a bit tricky to handle?

If we isolate Core.Compiler.Timings._timings to NativeInterpreter and make external AbstractInterpreter maintain its own timings, we will get cleaner graphs like

Timing(::GPUInterpreter, MethodInstance for xxx, ...)
 Timing(::GPUInterpreter, MethodInstance for yyy, ...)
  Timing(::GPUInterpreter, MethodInstance for zzz, ...)

and

Timing(::NativeInterpreter, MethodInstance for GPUCompiler.abstract_eval_zzz, ...)
 Timing(::NativeInterpreter, MethodInstance for GPUCompiler.some_utility_for_abstract_eval_zzz, ...)

vchuravy and others added 4 commits April 25, 2023 15:41
Allows consumers to separate native inference from inference coming
from a custom AbstractInterpreter.

Co-authored-by: Shuhei Kadowaki <40514306+aviatesk@users.noreply.github.com>
@aviatesk aviatesk changed the title [Inference] Mark timing with typeof(interp) [Inference] enable inference timing recording only for NativeInterpreter Apr 25, 2023
@aviatesk aviatesk changed the title [Inference] enable inference timing recording only for NativeInterpreter [Inference] limit inference timing recording to NativeInterpreter only Apr 25, 2023
@aviatesk aviatesk merged commit 3db036e into master Apr 25, 2023
@aviatesk aviatesk deleted the vc/inf_timings branch April 25, 2023 11:17
vchuravy added a commit that referenced this pull request Apr 25, 2023
…nly (#49391)

The logic of `Core.Compiler.Timings` assumes that the whole recorded
inference graph is constructed by the same interpreter, thus we should
limit the inference timing recording to `NativeInterpreter` only.

External `AbstractInterpreter` can implement its own recording logic,
likely by reusing existing `Core.Compiler.Timings` utilities, in a way
that does not interfere with the recording for native compilation pipeline.

---------

Co-authored-by: Shuhei Kadowaki <aviatesk@gmail.com>
(cherry picked from commit 3db036e)
@KristofferC KristofferC mentioned this pull request May 8, 2023
51 tasks
@KristofferC KristofferC removed the backport 1.9 Change should be backported to release-1.9 label May 28, 2023
vchuravy added a commit that referenced this pull request Apr 15, 2024
…reters (#54069)

Partially reverts #49391

PrecompileTools uses the timing infrastructure to snoop on the inference
process.
The reason for #49391 was that this could lead to accidental pollution
of the caches
with foreign results
(timholy/SnoopCompile.jl#338)

After #52233 and especially #53336 we now filter results by cache owner
and
don't try to cache foreign code using the native pipeline.

Motivated by JuliaGPU/GPUCompiler.jl#567 which
demonstrated
that a foreign code instance would not be cached without
PrecompileTools.
KristofferC pushed a commit that referenced this pull request Apr 17, 2024
…reters (#54069)

Partially reverts #49391

PrecompileTools uses the timing infrastructure to snoop on the inference
process.
The reason for #49391 was that this could lead to accidental pollution
of the caches
with foreign results
(timholy/SnoopCompile.jl#338)

After #52233 and especially #53336 we now filter results by cache owner
and
don't try to cache foreign code using the native pipeline.

Motivated by JuliaGPU/GPUCompiler.jl#567 which
demonstrated
that a foreign code instance would not be cached without
PrecompileTools.

(cherry picked from commit c0611e8)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
caching compiler:inference Type inference gpu Affects running Julia on a GPU
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants