Attempts to reduce latency #2225

charleskawczynski · 2025-03-10T18:29:57Z

This PR is an attempt to reduce latency. This was motivated by the example in #2215.

using SnoopCompileCore
using ClimaCore.CommonSpaces
import ClimaCore
import ClimaCore: Fields, Geometry, Operators, Spaces
import LinearAlgebra: dot

space = ExtrudedCubedSphereSpace(; z_elem = 10, z_min = 0, z_max = 1, radius = 10, h_elem = 10, n_quad_points = 4, staggering = CellCenter(), );

ᶜuₕ = similar(zeros(space), Geometry.Covariant12Vector{Float64});
ᶠu₃ = similar(zeros(Spaces.face_space(space)), Geometry.Covariant3Vector{Float64});
ᶠu³ = similar(zeros(Spaces.face_space(space)), Geometry.Contravariant3Vector{Float64});
ᶠuₕ³ = similar(zeros(Spaces.face_space(space)), Geometry.Contravariant3Vector{Float64});
ᶜu = similar(zeros(space), Geometry.Covariant123Vector{Float64});
fill!(parent(ᶜuₕ), 0);
fill!(parent(ᶠu₃), 0);
fill!(parent(ᶠuₕ³), 0);
fill!(parent(ᶜu), 0);
fill!(parent(ᶠu³), 0);
ᶜK = zeros(space);

# Warm InterpolateF2C up
_ = @. Operators.InterpolateF2C()(Geometry.Covariant123Vector(ᶠu₃));


function mytest(ᶜK, ᶜu, ᶠuₕ³, ᶠu³, ᶜuₕ, ᶠu₃)
    interp_f2c = Operators.InterpolateF2C()
    @. ᶜu = Geometry.Covariant123Vector(ᶜuₕ) + interp_f2c(Geometry.Covariant123Vector(ᶠu₃))
    @. ᶠu³ = ᶠuₕ³ + Geometry.Contravariant3Vector(ᶠu₃)
    @. ᶜK = 1 / 2 * (
        dot(Geometry.Covariant123Vector(ᶜuₕ), Geometry.Contravariant123Vector(ᶜuₕ)) +
        interp_f2c(dot(Geometry.Covariant123Vector(ᶠu₃), Geometry.Contravariant123Vector(ᶠu₃))) +
        2 * dot(Geometry.Contravariant123Vector(ᶜuₕ), interp_f2c(Geometry.Covariant123Vector(ᶠu₃)))
    )
    return nothing
end
@time mytest(ᶜK, ᶜu, ᶠuₕ³, ᶠu³, ᶜuₕ, ᶠu₃)

Several experiments were attempted, some were found to be fruitful. These experiments were performed with Julia 1.10:

UnrolledUtilities.unroll_map (naive)
  0.797456 seconds (3.07 M allocations: 220.475 MiB, 23.52% gc time, 94.72% compilation time)
  37.969 ms (0 allocations: 0 bytes)

UnrolledUtilities.unroll_map with Base.@_inline_meta
  3.422859 seconds (46.47 M allocations: 2.352 GiB, 22.98% gc time, 99.61% compilation time)
  1.853 ms (0 allocations: 0 bytes)

UnrolledUtilities.unroll_map with Base.@_propagate_inbounds_meta
  3.190773 seconds (46.47 M allocations: 2.352 GiB, 21.57% gc time, 99.76% compilation time)
  1.877 ms (0 allocations: 0 bytes)

UnrolledUtilities.unrolled_map_into_tuple (naive)
  0.669018 seconds (2.79 M allocations: 199.967 MiB, 14.59% gc time, 93.86% compilation time)
  38.014 ms (0 allocations: 0 bytes)

UnrolledUtilities.unrolled_map_into_tuple with Base.@_propagate_inbounds_meta
  2.343249 seconds (34.25 M allocations: 1.761 GiB, 16.47% gc time, 99.60% compilation time)
  1.900 ms (0 allocations: 0 bytes)

UnrolledUtilities.unrolled_map_into_tuple with Base.@_inline_meta
  2.629591 seconds (34.25 M allocations: 1.761 GiB, 15.49% gc time, 99.73% compilation time)
  1.861 ms (0 allocations: 0 bytes)

@ncall with Base.@_propagate_inbounds_meta
  1.722072 seconds (17.49 M allocations: 981.042 MiB, 19.83% gc time, 98.70% compilation time)
  1.904 ms (0 allocations: 0 bytes)

@ncall with Base.@_propagate_inbounds_meta + type annotations
  1.340580 seconds (17.50 M allocations: 980.648 MiB, 16.88% gc time, 99.72% compilation time)
  1.741 ms (0 allocations: 0 bytes)

@ncall with Base.@_propagate_inbounds_meta + type annotations + mygetidx
  1.715542 seconds (21.53 M allocations: 1.127 GiB, 16.88% gc time, 99.31% compilation time)
  1.744 ms (0 allocations: 0 bytes)

@ncall with Base.@_propagate_inbounds_meta + type annotations + mygetidx + (mygetidx type annotations)
  1.553747 seconds (21.54 M allocations: 1.128 GiB, 14.65% gc time, 99.16% compilation time)
  1.742 ms (0 allocations: 0 bytes)

@ncall with Base.@_propagate_inbounds_meta + type annotations + union splitting in getidx
  1.018555 seconds (11.33 M allocations: 677.195 MiB, 16.38% gc time, 99.62% compilation time)
  1.747 ms (0 allocations: 0 bytes)

@ncall with Base.@_propagate_inbounds_meta + type annotations + union splitting in getidx, no Base.promote_op
  0.980109 seconds (11.35 M allocations: 670.100 MiB, 13.69% gc time, 99.63% compilation time)
  1.745 ms (0 allocations: 0 bytes)

@ncall with Base.@_propagate_inbounds_meta + type annotations + union splitting in getidx, no Base.promote_op, "union-splitting" in `get_struct`
  0.805466 seconds (8.31 M allocations: 488.659 MiB, 8.03% gc time, 99.56% compilation time)
  1.780 ms (0 allocations: 0 bytes)

@ncall with Base.@_propagate_inbounds_meta + type annotations + union splitting in getidx, no Base.promote_op, "union-splitting" in `get_struct` + PrecompileTools
  0.766675 seconds (8.18 M allocations: 479.529 MiB, 7.90% gc time, 99.51% compilation time)
  1.752 ms (0 allocations: 0 bytes)

@ncall without Base.@_propagate_inbounds_meta
  0.593864 seconds (2.53 M allocations: 175.663 MiB, 17.64% gc time, 98.32% compilation time)
  7.158 ms (0 allocations: 0 bytes)

main
  1.828435 seconds (21.47 M allocations: 1.170 GiB, 14.75% gc time, 99.57% compilation time)
  1.874 ms (0 allocations: 0 bytes)

Using SnoopCompile with flamegraphs:

using SnoopCompileCore
using ClimaCore.CommonSpaces
import ClimaCore
import ClimaCore: Fields, Geometry, Operators, Spaces
import LinearAlgebra: dot

space = ExtrudedCubedSphereSpace(; z_elem = 10, z_min = 0, z_max = 1, radius = 10, h_elem = 10, n_quad_points = 4, staggering = CellCenter(), );

ᶜuₕ = similar(zeros(space), Geometry.Covariant12Vector{Float64});
ᶠu₃ = similar(zeros(Spaces.face_space(space)), Geometry.Covariant3Vector{Float64});
ᶠu³ = similar(zeros(Spaces.face_space(space)), Geometry.Contravariant3Vector{Float64});
ᶠuₕ³ = similar(zeros(Spaces.face_space(space)), Geometry.Contravariant3Vector{Float64});
ᶜu = similar(zeros(space), Geometry.Covariant123Vector{Float64});
fill!(parent(ᶜuₕ), 0);
fill!(parent(ᶠu₃), 0);
fill!(parent(ᶠuₕ³), 0);
fill!(parent(ᶜu), 0);
fill!(parent(ᶠu³), 0);
ᶜK = zeros(space);

# Warm InterpolateF2C up
_ = @. Operators.InterpolateF2C()(Geometry.Covariant123Vector(ᶠu₃));

tinf = @snoop_inference begin

    function mytest(ᶜK, ᶜu, ᶠuₕ³, ᶠu³, ᶜuₕ, ᶠu₃)
        interp_f2c = Operators.InterpolateF2C()
        @. ᶜu = Geometry.Covariant123Vector(ᶜuₕ) + interp_f2c(Geometry.Covariant123Vector(ᶠu₃))
        @. ᶠu³ = ᶠuₕ³ + Geometry.Contravariant3Vector(ᶠu₃)
        @. ᶜK = 1 / 2 * (
            dot(Geometry.Covariant123Vector(ᶜuₕ), Geometry.Contravariant123Vector(ᶜuₕ)) +
            interp_f2c(dot(Geometry.Covariant123Vector(ᶠu₃), Geometry.Contravariant123Vector(ᶠu₃))) +
            2 * dot(Geometry.Contravariant123Vector(ᶜuₕ), interp_f2c(Geometry.Covariant123Vector(ᶠu₃)))
        )
        return nothing
    end
    mytest(ᶜK, ᶜu, ᶠuₕ³, ᶠu³, ᶜuₕ, ᶠu₃)

end;

fg = flamegraph(tinf)
using ProfileView
ProfileView.view(fg)

charleskawczynski · 2025-03-17T18:13:20Z

Main branch:

This branch:

Some important points are:

The big purple and green blocks are now gone, via PrecompileTools for the Geometry module.
manually calling bc.f, instead of call_bc_f, means fewer methods need compiled, reducing the number of recursion levels
using Union-splitting reduced the number of getidx methods that need compiled
Type annotations have reduced the size of the top platforms in stencil_interior

charleskawczynski · 2025-03-17T18:14:12Z

On Julia 1.11:

# main
  7.758546 seconds (105.21 M allocations: 4.691 GiB, 27.90% gc time, 99.69% compilation time)
  1.794 ms (0 allocations: 0 bytes)
# This branch
  2.511549 seconds (42.33 M allocations: 1.861 GiB, 24.26% gc time, 99.73% compilation time)
  1.646 ms (0 allocations: 0 bytes)

charleskawczynski · 2025-03-18T15:50:16Z

Unfortunately, we cannot type-hint on getidx calls and also allow passing integer data into boundary conditions. For example:

const ᶜadvdivᵥ = Operators.DivergenceF2C(
    bottom = Operators.SetValue(CT3(0)),
    top = Operators.SetValue(CT3(0)),
)

This pattern costs us ~10-20% latency on our broadcast (I think all?) expressions.

Since fixing this would be a breaking change, I'm going to remove this part.

charleskawczynski · 2025-03-19T17:47:45Z

Running this in ClimaAtmos:

import ClimaComms
ClimaComms.@import_required_backends
import ClimaAtmos as CA
import SciMLBase
import Random
Random.seed!(1234)
empty!(ARGS);
push!(ARGS, "--config_file", "config/model_configs/diagnostic_edmfx_trmm_stretched_box.yml");
push!(ARGS, "--job_id", "diagnostic_edmfx_trmm_stretched_box");
@time begin
    (; config_file, job_id) = CA.commandline_kwargs();
    config = CA.AtmosConfig(config_file; job_id);
    simulation = CA.AtmosSimulation(config);
    (; integrator) = simulation;
    SciMLBase.step!(integrator)
end

Yields:

Main branch:
451.458446 seconds (2.59 G allocations: 121.090 GiB, 38.70% gc time, 99.44% compilation time: <1% of which was recompilation)

This branch:
346.996061 seconds (2.07 G allocations: 94.856 GiB, 29.27% gc time, 99.72% compilation time)

So, this appears to be showing benefits in a larger context, in atmos.

dennisYatunin

It's unfortunate that we can't abstract out any of these changes into UnrolledUtilities without hurting latency, but I expect that by further simplifying the broadcasting and stencil code we can avoid triggering compilation heuristics through more layers of inlining. Until then, looks like we're stuck with manually doing inlining for the compiler. As long as these shenanigans remain isolated to struct.jl and finitedifference.jl, I'm happy with the changes here.

charleskawczynski requested review from Sbozzolo, dennisYatunin, imreddyTeja, juliasloan25, nefrathenrici and sriharshakandala March 10, 2025 18:29

charleskawczynski force-pushed the ck/getidx_args_latency branch from 3634c5e to 89c5e57 Compare March 11, 2025 01:30

charleskawczynski mentioned this pull request Mar 11, 2025

Non-determinism in examples, traced back to initial implicit residual in ImplicitEquationJacobian #2228

Open

charleskawczynski force-pushed the ck/getidx_args_latency branch 3 times, most recently from 0f2289b to 235eae6 Compare March 12, 2025 18:07

charleskawczynski changed the title ~~Use Base.Cartesian to reduce latency in getidx_args~~ Attempts to reduce latency Mar 17, 2025

charleskawczynski force-pushed the ck/getidx_args_latency branch 6 times, most recently from d59ac82 to e6ee4d6 Compare March 17, 2025 19:47

dennisYatunin mentioned this pull request Mar 17, 2025

Add unrolled_mapcall function CliMA/UnrolledUtilities.jl#26

Draft

charleskawczynski force-pushed the ck/getidx_args_latency branch from e6ee4d6 to 0363ae8 Compare March 18, 2025 15:50

charleskawczynski force-pushed the ck/getidx_args_latency branch 2 times, most recently from 36249bc to ca4aa4e Compare March 19, 2025 17:25

charleskawczynski force-pushed the ck/getidx_args_latency branch from ca4aa4e to f0d92c1 Compare March 19, 2025 17:59

charleskawczynski mentioned this pull request Mar 19, 2025

Add PrecompileTools for Geometry module #2249

Open

Reduce latency in stencil operators

69ff0d9

charleskawczynski force-pushed the ck/getidx_args_latency branch from f0d92c1 to 69ff0d9 Compare March 19, 2025 18:49

dennisYatunin approved these changes Mar 19, 2025

View reviewed changes

charleskawczynski merged commit 50b1dd1 into main Mar 19, 2025
31 of 34 checks passed

charleskawczynski deleted the ck/getidx_args_latency branch March 19, 2025 20:07

charleskawczynski mentioned this pull request Mar 19, 2025

Add user-documentation for isprecompilable JuliaDebug/SnoopCompile.jl#411

Closed

charleskawczynski added the Latency label May 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Attempts to reduce latency #2225

Attempts to reduce latency #2225

Uh oh!

charleskawczynski commented Mar 10, 2025 •

edited

Loading

Uh oh!

charleskawczynski commented Mar 17, 2025

Uh oh!

charleskawczynski commented Mar 17, 2025

Uh oh!

charleskawczynski commented Mar 18, 2025

Uh oh!

charleskawczynski commented Mar 19, 2025

Uh oh!

dennisYatunin left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Attempts to reduce latency #2225

Attempts to reduce latency #2225

Uh oh!

Conversation

charleskawczynski commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

charleskawczynski commented Mar 17, 2025

Uh oh!

charleskawczynski commented Mar 17, 2025

Uh oh!

charleskawczynski commented Mar 18, 2025

Uh oh!

charleskawczynski commented Mar 19, 2025

Uh oh!

dennisYatunin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

charleskawczynski commented Mar 10, 2025 •

edited

Loading