Skip to content

Conversation

@charleskawczynski
Copy link
Member

@charleskawczynski charleskawczynski commented Mar 10, 2025

This PR is an attempt to reduce latency. This was motivated by the example in #2215.

using SnoopCompileCore
using ClimaCore.CommonSpaces
import ClimaCore
import ClimaCore: Fields, Geometry, Operators, Spaces
import LinearAlgebra: dot

space = ExtrudedCubedSphereSpace(; z_elem = 10, z_min = 0, z_max = 1, radius = 10, h_elem = 10, n_quad_points = 4, staggering = CellCenter(), );

ᶜuₕ = similar(zeros(space), Geometry.Covariant12Vector{Float64});
ᶠu₃ = similar(zeros(Spaces.face_space(space)), Geometry.Covariant3Vector{Float64});
ᶠu³ = similar(zeros(Spaces.face_space(space)), Geometry.Contravariant3Vector{Float64});
ᶠuₕ³ = similar(zeros(Spaces.face_space(space)), Geometry.Contravariant3Vector{Float64});
ᶜu = similar(zeros(space), Geometry.Covariant123Vector{Float64});
fill!(parent(ᶜuₕ), 0);
fill!(parent(ᶠu₃), 0);
fill!(parent(ᶠuₕ³), 0);
fill!(parent(ᶜu), 0);
fill!(parent(ᶠu³), 0);
ᶜK = zeros(space);

# Warm InterpolateF2C up
_ = @. Operators.InterpolateF2C()(Geometry.Covariant123Vector(ᶠu₃));


function mytest(ᶜK, ᶜu, ᶠuₕ³, ᶠu³, ᶜuₕ, ᶠu₃)
    interp_f2c = Operators.InterpolateF2C()
    @. ᶜu = Geometry.Covariant123Vector(ᶜuₕ) + interp_f2c(Geometry.Covariant123Vector(ᶠu₃))
    @. ᶠu³ = ᶠuₕ³ + Geometry.Contravariant3Vector(ᶠu₃)
    @. ᶜK = 1 / 2 * (
        dot(Geometry.Covariant123Vector(ᶜuₕ), Geometry.Contravariant123Vector(ᶜuₕ)) +
        interp_f2c(dot(Geometry.Covariant123Vector(ᶠu₃), Geometry.Contravariant123Vector(ᶠu₃))) +
        2 * dot(Geometry.Contravariant123Vector(ᶜuₕ), interp_f2c(Geometry.Covariant123Vector(ᶠu₃)))
    )
    return nothing
end
@time mytest(ᶜK, ᶜu, ᶠuₕ³, ᶠu³, ᶜuₕ, ᶠu₃)

Several experiments were attempted, some were found to be fruitful. These experiments were performed with Julia 1.10:

UnrolledUtilities.unroll_map (naive)
  0.797456 seconds (3.07 M allocations: 220.475 MiB, 23.52% gc time, 94.72% compilation time)
  37.969 ms (0 allocations: 0 bytes)

UnrolledUtilities.unroll_map with Base.@_inline_meta
  3.422859 seconds (46.47 M allocations: 2.352 GiB, 22.98% gc time, 99.61% compilation time)
  1.853 ms (0 allocations: 0 bytes)

UnrolledUtilities.unroll_map with Base.@_propagate_inbounds_meta
  3.190773 seconds (46.47 M allocations: 2.352 GiB, 21.57% gc time, 99.76% compilation time)
  1.877 ms (0 allocations: 0 bytes)

UnrolledUtilities.unrolled_map_into_tuple (naive)
  0.669018 seconds (2.79 M allocations: 199.967 MiB, 14.59% gc time, 93.86% compilation time)
  38.014 ms (0 allocations: 0 bytes)

UnrolledUtilities.unrolled_map_into_tuple with Base.@_propagate_inbounds_meta
  2.343249 seconds (34.25 M allocations: 1.761 GiB, 16.47% gc time, 99.60% compilation time)
  1.900 ms (0 allocations: 0 bytes)

UnrolledUtilities.unrolled_map_into_tuple with Base.@_inline_meta
  2.629591 seconds (34.25 M allocations: 1.761 GiB, 15.49% gc time, 99.73% compilation time)
  1.861 ms (0 allocations: 0 bytes)

@ncall with Base.@_propagate_inbounds_meta
  1.722072 seconds (17.49 M allocations: 981.042 MiB, 19.83% gc time, 98.70% compilation time)
  1.904 ms (0 allocations: 0 bytes)

@ncall with Base.@_propagate_inbounds_meta + type annotations
  1.340580 seconds (17.50 M allocations: 980.648 MiB, 16.88% gc time, 99.72% compilation time)
  1.741 ms (0 allocations: 0 bytes)

@ncall with Base.@_propagate_inbounds_meta + type annotations + mygetidx
  1.715542 seconds (21.53 M allocations: 1.127 GiB, 16.88% gc time, 99.31% compilation time)
  1.744 ms (0 allocations: 0 bytes)

@ncall with Base.@_propagate_inbounds_meta + type annotations + mygetidx + (mygetidx type annotations)
  1.553747 seconds (21.54 M allocations: 1.128 GiB, 14.65% gc time, 99.16% compilation time)
  1.742 ms (0 allocations: 0 bytes)

@ncall with Base.@_propagate_inbounds_meta + type annotations + union splitting in getidx
  1.018555 seconds (11.33 M allocations: 677.195 MiB, 16.38% gc time, 99.62% compilation time)
  1.747 ms (0 allocations: 0 bytes)

@ncall with Base.@_propagate_inbounds_meta + type annotations + union splitting in getidx, no Base.promote_op
  0.980109 seconds (11.35 M allocations: 670.100 MiB, 13.69% gc time, 99.63% compilation time)
  1.745 ms (0 allocations: 0 bytes)

@ncall with Base.@_propagate_inbounds_meta + type annotations + union splitting in getidx, no Base.promote_op, "union-splitting" in `get_struct`
  0.805466 seconds (8.31 M allocations: 488.659 MiB, 8.03% gc time, 99.56% compilation time)
  1.780 ms (0 allocations: 0 bytes)

@ncall with Base.@_propagate_inbounds_meta + type annotations + union splitting in getidx, no Base.promote_op, "union-splitting" in `get_struct` + PrecompileTools
  0.766675 seconds (8.18 M allocations: 479.529 MiB, 7.90% gc time, 99.51% compilation time)
  1.752 ms (0 allocations: 0 bytes)

@ncall without Base.@_propagate_inbounds_meta
  0.593864 seconds (2.53 M allocations: 175.663 MiB, 17.64% gc time, 98.32% compilation time)
  7.158 ms (0 allocations: 0 bytes)

main
  1.828435 seconds (21.47 M allocations: 1.170 GiB, 14.75% gc time, 99.57% compilation time)
  1.874 ms (0 allocations: 0 bytes)

Using SnoopCompile with flamegraphs:

using SnoopCompileCore
using ClimaCore.CommonSpaces
import ClimaCore
import ClimaCore: Fields, Geometry, Operators, Spaces
import LinearAlgebra: dot

space = ExtrudedCubedSphereSpace(; z_elem = 10, z_min = 0, z_max = 1, radius = 10, h_elem = 10, n_quad_points = 4, staggering = CellCenter(), );

ᶜuₕ = similar(zeros(space), Geometry.Covariant12Vector{Float64});
ᶠu₃ = similar(zeros(Spaces.face_space(space)), Geometry.Covariant3Vector{Float64});
ᶠu³ = similar(zeros(Spaces.face_space(space)), Geometry.Contravariant3Vector{Float64});
ᶠuₕ³ = similar(zeros(Spaces.face_space(space)), Geometry.Contravariant3Vector{Float64});
ᶜu = similar(zeros(space), Geometry.Covariant123Vector{Float64});
fill!(parent(ᶜuₕ), 0);
fill!(parent(ᶠu₃), 0);
fill!(parent(ᶠuₕ³), 0);
fill!(parent(ᶜu), 0);
fill!(parent(ᶠu³), 0);
ᶜK = zeros(space);

# Warm InterpolateF2C up
_ = @. Operators.InterpolateF2C()(Geometry.Covariant123Vector(ᶠu₃));

tinf = @snoop_inference begin

    function mytest(ᶜK, ᶜu, ᶠuₕ³, ᶠu³, ᶜuₕ, ᶠu₃)
        interp_f2c = Operators.InterpolateF2C()
        @. ᶜu = Geometry.Covariant123Vector(ᶜuₕ) + interp_f2c(Geometry.Covariant123Vector(ᶠu₃))
        @. ᶠu³ = ᶠuₕ³ + Geometry.Contravariant3Vector(ᶠu₃)
        @. ᶜK = 1 / 2 * (
            dot(Geometry.Covariant123Vector(ᶜuₕ), Geometry.Contravariant123Vector(ᶜuₕ)) +
            interp_f2c(dot(Geometry.Covariant123Vector(ᶠu₃), Geometry.Contravariant123Vector(ᶠu₃))) +
            2 * dot(Geometry.Contravariant123Vector(ᶜuₕ), interp_f2c(Geometry.Covariant123Vector(ᶠu₃)))
        )
        return nothing
    end
    mytest(ᶜK, ᶜu, ᶠuₕ³, ᶠu³, ᶜuₕ, ᶠu₃)

end;

fg = flamegraph(tinf)
using ProfileView
ProfileView.view(fg)

@charleskawczynski charleskawczynski force-pushed the ck/getidx_args_latency branch 3 times, most recently from 0f2289b to 235eae6 Compare March 12, 2025 18:07
@charleskawczynski charleskawczynski changed the title Use Base.Cartesian to reduce latency in getidx_args Attempts to reduce latency Mar 17, 2025
@charleskawczynski
Copy link
Member Author

Main branch:
image

This branch:
image

Some important points are:

  • The big purple and green blocks are now gone, via PrecompileTools for the Geometry module.
  • manually calling bc.f, instead of call_bc_f, means fewer methods need compiled, reducing the number of recursion levels
  • using Union-splitting reduced the number of getidx methods that need compiled
  • Type annotations have reduced the size of the top platforms in stencil_interior

@charleskawczynski
Copy link
Member Author

On Julia 1.11:

# main
  7.758546 seconds (105.21 M allocations: 4.691 GiB, 27.90% gc time, 99.69% compilation time)
  1.794 ms (0 allocations: 0 bytes)
# This branch
  2.511549 seconds (42.33 M allocations: 1.861 GiB, 24.26% gc time, 99.73% compilation time)
  1.646 ms (0 allocations: 0 bytes)

@charleskawczynski
Copy link
Member Author

Unfortunately, we cannot type-hint on getidx calls and also allow passing integer data into boundary conditions. For example:

const ᶜadvdivᵥ = Operators.DivergenceF2C(
    bottom = Operators.SetValue(CT3(0)),
    top = Operators.SetValue(CT3(0)),
)

This pattern costs us ~10-20% latency on our broadcast (I think all?) expressions.

Since fixing this would be a breaking change, I'm going to remove this part.

@charleskawczynski charleskawczynski force-pushed the ck/getidx_args_latency branch 2 times, most recently from 36249bc to ca4aa4e Compare March 19, 2025 17:25
@charleskawczynski
Copy link
Member Author

Running this in ClimaAtmos:

import ClimaComms
ClimaComms.@import_required_backends
import ClimaAtmos as CA
import SciMLBase
import Random
Random.seed!(1234)
empty!(ARGS);
push!(ARGS, "--config_file", "config/model_configs/diagnostic_edmfx_trmm_stretched_box.yml");
push!(ARGS, "--job_id", "diagnostic_edmfx_trmm_stretched_box");
@time begin
    (; config_file, job_id) = CA.commandline_kwargs();
    config = CA.AtmosConfig(config_file; job_id);
    simulation = CA.AtmosSimulation(config);
    (; integrator) = simulation;
    SciMLBase.step!(integrator)
end

Yields:

Main branch:
451.458446 seconds (2.59 G allocations: 121.090 GiB, 38.70% gc time, 99.44% compilation time: <1% of which was recompilation)

This branch:
346.996061 seconds (2.07 G allocations: 94.856 GiB, 29.27% gc time, 99.72% compilation time)

So, this appears to be showing benefits in a larger context, in atmos.

Copy link
Member

@dennisYatunin dennisYatunin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unfortunate that we can't abstract out any of these changes into UnrolledUtilities without hurting latency, but I expect that by further simplifying the broadcasting and stencil code we can avoid triggering compilation heuristics through more layers of inlining. Until then, looks like we're stuck with manually doing inlining for the compiler. As long as these shenanigans remain isolated to struct.jl and finitedifference.jl, I'm happy with the changes here.

@charleskawczynski charleskawczynski merged commit 50b1dd1 into main Mar 19, 2025
31 of 34 checks passed
@charleskawczynski charleskawczynski deleted the ck/getidx_args_latency branch March 19, 2025 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants