Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compiler support for optimizing PersistentDict #51993

Merged
merged 1 commit into from
Nov 27, 2023
Merged

Compiler support for optimizing PersistentDict #51993

merged 1 commit into from
Nov 27, 2023

Conversation

Keno
Copy link
Member

@Keno Keno commented Nov 2, 2023

This is part of the work to address #51352 by attempting to allow the compiler to perform SRAO on persistent data structures like PersistentDict as if they were regular immutable data structures. These sorts of data structures have very complicated internals (with lots of mutation, memory sharing, etc.), but a relatively simple interface. As such, it is unlikely that our compiler will have sufficient power to optimize this interface by analyzing the implementation.

We thus need to come up with some other mechanism that gives the compiler license to perform the requisite optimization. One way would be to just hardcode PersistentDict into the compiler, optimizing it like any of the other builtin datatypes. However, this is of course very unsatisfying. At the other end of the spectrum would be something like a generic rewrite rule system (e-graphs anyone?) that would let the PersistentDict implementation declare its interface to the compiler and the compiler would use this for optimization (in a perfect world, the actual rewrite would then be checked using some sort of formal methods). I think that would be interesting, but we're very far from even being able to design something like that (at least in Base - experiments with external AbstractInterpreters in this direction are encouraged).

This PR tries to come up with a reasonable middle ground, where the compiler gets some knowledge of the protocol hardcoded without having to know about the implementation details of the data structure.

The basic ideas is that Core provides some magic generic functions that implementations can extend. Semantically, they are not special. They dispatch as usual, and implementations are expected to work properly even in the absence of any compiler optimizations.

However, the compiler is semantically permitted to perform structural optimization using these magic generic functions. In the concrete case, this PR introduces the KeyValue interface which consists of two generic functions, get and set. The core optimization is that the compiler is allowed to rewrite any occurrence of get(set(x, k, v), k) into v without additional legality checks. In particular, the compiler performs no type checks, conversions, etc. The higher level implementation code is expected to do all that.

This approach closely matches the general direction we've been taking in external AbstractInterpreters for embedding additional semantics and optimization opportunities into Julia code (although we generally use methods there, rather than full generic functions), so I think we have some evidence that this sort of approach works reasonably well.

Nevertheless, this is certainly an experiment and the interface is explicitly declared unstable.

Current Status

This is fully working and implemented, but the optimization currently bails on anything but the simplest cases. Filling all those cases in is not particularly hard, but should be done along with a more invasive refactoring of SROA, so we should figure out the general direction here first and then we can finish all that up in a follow-up cleanup.

Obligatory benchmark

Before:

julia> using BenchmarkTools

julia> function foo()
           a = Base.PersistentDict(:a => 1)
           return a[:a]
       end
foo (generic function with 1 method)

julia> @benchmark foo()
BenchmarkTools.Trial: 10000 samples with 993 evaluations.
 Range (min … max):  32.940 ns …  28.754 μs  ┊ GC (min … max):  0.00% … 99.76%
 Time  (median):     49.647 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   57.519 ns ± 333.275 ns  ┊ GC (mean ± σ):  10.81% ±  2.22%

        ▃█▅               ▁▃▅▅▃▁                ▁▃▂   ▂
  ▁▂▄▃▅▇███▇▃▁▂▁▁▁▁▁▁▁▁▂▂▅██████▅▂▁▁▁▁▁▁▁▁▁▁▂▃▃▇███▇▆███▆▄▃▃▂▂ ▃
  32.9 ns         Histogram: frequency by time         68.6 ns <

 Memory estimate: 128 bytes, allocs estimate: 4.

julia> @code_typed foo()
CodeInfo(
1 ─ %1  = invoke Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}(Base.HashArrayMappedTries.undef::UndefInitializer, 1::Int64)::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
│   %2  = %new(Base.HashArrayMappedTries.HAMT{Symbol, Int64}, %1, 0x00000000)::Base.HashArrayMappedTries.HAMT{Symbol, Int64}
│   %3  = %new(Base.HashArrayMappedTries.Leaf{Symbol, Int64}, :a, 1)::Base.HashArrayMappedTries.Leaf{Symbol, Int64}
│   %4  = Base.getfield(%2, :data)::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
│   %5  = $(Expr(:boundscheck, true))::Bool
└──       goto #5 if not %5
2 ─ %7  = Base.sub_int(1, 1)::Int64
│   %8  = Base.bitcast(UInt64, %7)::UInt64
│   %9  = Base.getfield(%4, :size)::Tuple{Int64}
│   %10 = $(Expr(:boundscheck, true))::Bool
│   %11 = Base.getfield(%9, 1, %10)::Int64
│   %12 = Base.bitcast(UInt64, %11)::UInt64
│   %13 = Base.ult_int(%8, %12)::Bool
└──       goto #4 if not %13
3 ─       goto #5
4 ─ %16 = Core.tuple(1)::Tuple{Int64}
│         invoke Base.throw_boundserror(%4::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}, %16::Tuple{Int64})::Union{}
└──       unreachable
5 ┄ %19 = Base.getfield(%4, :ref)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
│   %20 = Base.memoryref(%19, 1, false)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
│         Base.memoryrefset!(%20, %3, :not_atomic, false)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
└──       goto #6
6 ─ %23 = Base.getfield(%2, :bitmap)::UInt32
│   %24 = Base.or_int(%23, 0x00010000)::UInt32
│         Base.setfield!(%2, :bitmap, %24)::UInt32
└──       goto #7
7 ─ %27 = %new(Base.PersistentDict{Symbol, Int64}, %2)::Base.PersistentDict{Symbol, Int64}
└──       goto #8
8 ─ %29 = invoke Base.getindex(%27::Base.PersistentDict{Symbol, Int64}, :a::Symbol)::Int64
└──       return %29

After:

julia> using BenchmarkTools

julia> function foo()
           a = Base.PersistentDict(:a => 1)
           return a[:a]
       end
foo (generic function with 1 method)

julia> @benchmark foo()
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  2.459 ns … 11.320 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.460 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.469 ns ±  0.183 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂    █                                              ▁    █ ▂
  █▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█ █
  2.46 ns      Histogram: log(frequency) by time     2.47 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @code_typed foo()
CodeInfo(
1 ─     return 1

@JeffBezanson JeffBezanson added the compiler:optimizer Optimization passes (mostly in base/compiler/ssair/) label Nov 2, 2023
@JeffBezanson
Copy link
Member

LGTM

@vchuravy
Copy link
Member

vchuravy commented Nov 2, 2023

this is really cool. A little bit scary but I prefer it over moving the definition into the compiler wholesale.

@topolarity
Copy link
Member

topolarity commented Nov 7, 2023

What are the consequences of actions that break interface assumptions?

julia> dict = Base.PersistentDict(0=>false)
Base.PersistentDict{Int64, Bool} with 1 entry:
  0 => 0

julia> dict.trie.data[1] = Base.HashArrayMappedTries.Leaf{Int64, Bool}(0, true)
Base.HashArrayMappedTries.Leaf{Int64, Bool}(0, true)

julia> dict
Base.PersistentDict{Int64, Bool} with 1 entry:
  0 => 1

Obviously a user "shouldn't" do this, but would this now trigger UB?

@Keno
Copy link
Member Author

Keno commented Nov 9, 2023

Obviously a user "shouldn't" do this, but would this now trigger UB?

Yes, any misuse is undefined behavior. There are some basic sanity checks that avoid applying the optimization if you things are obviously wrong, but we can't really say anything else.

@Keno
Copy link
Member Author

Keno commented Nov 9, 2023

I've given this some further thought, and I don't think the callback design I have here is all that good. I'm planning to rejigger this with an optional return for the getter, having it return Union{Nothing, Tuple{V}}. That currently has some performance implications, because in general a tuple return with arbitrary values is a bad idea (there's a reason we turn this into a struct all over the compiler), but we can fix that with a better ABI, which I'll put up as a separate PR.

#length(stmt.args) == 4 || return

collection = stmt.args[end-1]
key = stmt.args[end]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need some kind of a mutability check on key, since PersistentDict allows mutable keys but does not behave very sensibly for them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand what you mean by doesn't behave sensible? Or open a separate issue?

Copy link
Member

@topolarity topolarity Nov 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main problem for this optimization is:

julia> x = Int64[0];

julia> d = Base.PersistentDict(x => false, [1] => true);

julia> x[1] = 1

julia> d[x]
true

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could make it an IdDict equivalent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That still leaves types like IdDict out in the cold, since it is an immutable struct, that references mutable contents to compute the hash. There is isidentityfree that solves that issue though, if desired.

leaves::Vector{Any}, 𝕃ₒ::AbstractLattice)
# For every leaf, the lifted value
lifted_leaves = LiftedLeaves()
maybe_undef = false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
maybe_undef = false

Comment on lines 879 to 881
lifted_result = lift_leaves_keyvalue(compact, key, leaves, 𝕃ₒ)
lifted_result === nothing && return
lifted_leaves, any_undef = lifted_result
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
lifted_result = lift_leaves_keyvalue(compact, key, leaves, 𝕃ₒ)
lifted_result === nothing && return
lifted_leaves, any_undef = lifted_result
lifted_leaves = lift_leaves_keyvalue(compact, key, leaves, 𝕃ₒ)
lifted_leaves === nothing && return

@aviatesk
Copy link
Member

I don't think the callback design I have here is all that good. I'm planning to rejigger this with an optional return for the getter, having it return Union{Nothing, Tuple{V}}

What is the main reason for this? Is it because it makes debugging difficult when users are developing code to utilize this interface?
I'm a bit concerned about the performance impact of @noinlines that are necessary for SROA to recognize these interface primitives in cases when SROA can't be performed.

@aviatesk
Copy link
Member

aviatesk commented Nov 16, 2023

Rebased against the latest master.

@Keno
Copy link
Member Author

Keno commented Nov 18, 2023

What is the main reason for this?

The callbacks are a bit awkward, since it's not clear when to infer them, if you want to optimize them.

@Keno
Copy link
Member Author

Keno commented Nov 20, 2023

Review addressed, rebased, and updated in line with what I said I would do above. Needs #52193 to be sound, but other than that, I don't see much reason not to try this. The one remaining concern is that because of the @noinline the performance is slightly worse for loosely typed ScopedValues, requiring an extra allocation. However, as indicated above, that can definitely be addressed with some codegen ABI improvements. That said, I'm not planning on finishing those up immediately unless it becomes a problem - I think there's bigger fish to fry elsewhere.

Keno added a commit that referenced this pull request Nov 26, 2023
As discussed in #51352, this gives `EnterNode` the ability to set
(and restore on leave or catch edge) jl_current_task->scope. Manual
modifications of the task field after the task has started are
considered undefined behavior. In addition, we gain a new intrinsic
to access current_task->scope and both inference and the optimizer
will forward scopes from EnterNodes to this intrinsic (non-interprocedurally).
Together with #51993 this is sufficient to fully optimize ScopedValues
(non-interprocedurally at least).
Keno added a commit that referenced this pull request Nov 26, 2023
As discussed in #51352, this gives `EnterNode` the ability to set
(and restore on leave or catch edge) jl_current_task->scope. Manual
modifications of the task field after the task has started are
considered undefined behavior. In addition, we gain a new intrinsic
to access current_task->scope and both inference and the optimizer
will forward scopes from EnterNodes to this intrinsic (non-interprocedurally).
Together with #51993 this is sufficient to fully optimize ScopedValues
(non-interprocedurally at least).
This is part of the work to address #51352 by attempting to allow
the compiler to perform SRAO on persistent data structures like
`PersistentDict` as if they were regular immutable data structures.
These sorts of data structures have very complicated internals
(with lots of mutation, memory sharing, etc.), but a relatively
simple interface. As such, it is unlikely that our compiler will
have sufficient power to optimize this interface by analyzing
the implementation.

We thus need to come up with some other mechanism that gives the
compiler license to perform the requisite optimization. One way
would be to just hardcode `PersistentDict` into the compiler,
optimizing it like any of the other builtin datatypes. However,
this is of course very unsatisfying. At the other end of the
spectrum would be something like a generic rewrite rule system
(e-graphs anyone?) that would let the PersistentDict
implementation declare its interface to the compiler and the
compiler would use this for optimization (in a perfect world,
the actual rewrite would then be checked using some sort of
formal methods). I think that would be interesting, but we're
very far from even being able to design something like that
(at least in Base - experiments with external AbstractInterpreters
in this direction are encouraged).

This PR tries to come up with a reasonable middle ground, where
the compiler gets some knowledge of the protocol hardcoded without
having to know about the implementation details of the data structure.

The basic ideas is that `Core` provides some magic generic functions
that implementations can extend. Semantically, they are not special.
They dispatch as usual, and implementations are expected to work
properly even in the absence of any compiler optimizations.

However, the compiler is semantically permitted to perform structural
optimization using these magic generic functions. In the concrete
case, this PR introduces the `KeyValue` interface which consists
of two generic functions, `get` and `set`. The core optimization
is that the compiler is allowed to rewrite any occurrence of
`get(set(x, k, v), k)` into `v` without additional legality checks.
In particular, the compiler performs no type checks, conversions, etc.
The higher level implementation code is expected to do all that.

This approach closely matches the general direction we've been taking
in external AbstractInterpreters for embedding additional semantics
and optimization opportunities into Julia code (although we generally
use methods there, rather than full generic functions), so I think
we have some evidence that this sort of approach works reasonably well.

Nevertheless, this is certainly an experiment and the interface is
explicitly declared unstable.

This is fully working and implemented, but the optimization currently
bails on anything but the simplest cases. Filling all those cases in
is not particularly hard, but should be done along with a more invasive
refactoring of SROA, so we should figure out the general direction
here first and then we can finish all that up in a follow-up cleanup.

Before:
```
julia> using BenchmarkTools

julia> function foo()
           a = Base.PersistentDict(:a => 1)
           return a[:a]
       end
foo (generic function with 1 method)

julia> @benchmark foo()
BenchmarkTools.Trial: 10000 samples with 993 evaluations.
 Range (min … max):  32.940 ns …  28.754 μs  ┊ GC (min … max):  0.00% … 99.76%
 Time  (median):     49.647 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   57.519 ns ± 333.275 ns  ┊ GC (mean ± σ):  10.81% ±  2.22%

        ▃█▅               ▁▃▅▅▃▁                ▁▃▂   ▂
  ▁▂▄▃▅▇███▇▃▁▂▁▁▁▁▁▁▁▁▂▂▅██████▅▂▁▁▁▁▁▁▁▁▁▁▂▃▃▇███▇▆███▆▄▃▃▂▂ ▃
  32.9 ns         Histogram: frequency by time         68.6 ns <

 Memory estimate: 128 bytes, allocs estimate: 4.

julia> @code_typed foo()
CodeInfo(
1 ─ %1  = invoke Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}(Base.HashArrayMappedTries.undef::UndefInitializer, 1::Int64)::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
│   %2  = %new(Base.HashArrayMappedTries.HAMT{Symbol, Int64}, %1, 0x00000000)::Base.HashArrayMappedTries.HAMT{Symbol, Int64}
│   %3  = %new(Base.HashArrayMappedTries.Leaf{Symbol, Int64}, :a, 1)::Base.HashArrayMappedTries.Leaf{Symbol, Int64}
│   %4  = Base.getfield(%2, :data)::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
│   %5  = $(Expr(:boundscheck, true))::Bool
└──       goto #5 if not %5
2 ─ %7  = Base.sub_int(1, 1)::Int64
│   %8  = Base.bitcast(UInt64, %7)::UInt64
│   %9  = Base.getfield(%4, :size)::Tuple{Int64}
│   %10 = $(Expr(:boundscheck, true))::Bool
│   %11 = Base.getfield(%9, 1, %10)::Int64
│   %12 = Base.bitcast(UInt64, %11)::UInt64
│   %13 = Base.ult_int(%8, %12)::Bool
└──       goto #4 if not %13
3 ─       goto #5
4 ─ %16 = Core.tuple(1)::Tuple{Int64}
│         invoke Base.throw_boundserror(%4::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}, %16::Tuple{Int64})::Union{}
└──       unreachable
5 ┄ %19 = Base.getfield(%4, :ref)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
│   %20 = Base.memoryref(%19, 1, false)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
│         Base.memoryrefset!(%20, %3, :not_atomic, false)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
└──       goto #6
6 ─ %23 = Base.getfield(%2, :bitmap)::UInt32
│   %24 = Base.or_int(%23, 0x00010000)::UInt32
│         Base.setfield!(%2, :bitmap, %24)::UInt32
└──       goto #7
7 ─ %27 = %new(Base.PersistentDict{Symbol, Int64}, %2)::Base.PersistentDict{Symbol, Int64}
└──       goto #8
8 ─ %29 = invoke Base.getindex(%27::Base.PersistentDict{Symbol, Int64}, 🅰️:Symbol)::Int64
└──       return %29
```

After:
```
julia> using BenchmarkTools

julia> function foo()
           a = Base.PersistentDict(:a => 1)
           return a[:a]
       end
foo (generic function with 1 method)

julia> @benchmark foo()
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  2.459 ns … 11.320 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.460 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.469 ns ±  0.183 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂    █                                              ▁    █ ▂
  █▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█ █
  2.46 ns      Histogram: log(frequency) by time     2.47 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @code_typed foo()
CodeInfo(
1 ─     return 1
```
@Keno Keno changed the title RFC: Compiler support for optimizing PersistentDict Compiler support for optimizing PersistentDict Nov 26, 2023
@Keno
Copy link
Member Author

Keno commented Nov 26, 2023

Rebased. I think this is basically good to go. I'm gonna do some more experiments with this, but those might go into a follow up PR, depending.

@Keno
Copy link
Member Author

Keno commented Nov 27, 2023

Working well in my downstream testing. It does still not fully optimize ScopedValue, because we do not yet have the ability to remove all the setup code, which will need some effect tweaks, but that's independent of this, so I'll go ahead and merge this.

@Keno Keno merged commit 5b2fcb6 into master Nov 27, 2023
8 checks passed
@Keno Keno deleted the kf/dictopt branch November 27, 2023 17:01
Keno added a commit that referenced this pull request Nov 28, 2023
As discussed in #51352, this gives `EnterNode` the ability to set
(and restore on leave or catch edge) jl_current_task->scope. Manual
modifications of the task field after the task has started are
considered undefined behavior. In addition, we gain a new intrinsic
to access current_task->scope and both inference and the optimizer
will forward scopes from EnterNodes to this intrinsic (non-interprocedurally).
Together with #51993 this is sufficient to fully optimize ScopedValues
(non-interprocedurally at least).
mkitti pushed a commit to mkitti/julia that referenced this pull request Dec 9, 2023
This is part of the work to address JuliaLang#51352 by attempting to allow the
compiler to perform SRAO on persistent data structures like
`PersistentDict` as if they were regular immutable data structures.
These sorts of data structures have very complicated internals (with
lots of mutation, memory sharing, etc.), but a relatively simple
interface. As such, it is unlikely that our compiler will have
sufficient power to optimize this interface by analyzing the
implementation.

We thus need to come up with some other mechanism that gives the
compiler license to perform the requisite optimization. One way would be
to just hardcode `PersistentDict` into the compiler, optimizing it like
any of the other builtin datatypes. However, this is of course very
unsatisfying. At the other end of the spectrum would be something like a
generic rewrite rule system (e-graphs anyone?) that would let the
PersistentDict implementation declare its interface to the compiler and
the compiler would use this for optimization (in a perfect world, the
actual rewrite would then be checked using some sort of formal methods).
I think that would be interesting, but we're very far from even being
able to design something like that (at least in Base - experiments with
external AbstractInterpreters in this direction are encouraged).

This PR tries to come up with a reasonable middle ground, where the
compiler gets some knowledge of the protocol hardcoded without having to
know about the implementation details of the data structure.

The basic ideas is that `Core` provides some magic generic functions
that implementations can extend. Semantically, they are not special.
They dispatch as usual, and implementations are expected to work
properly even in the absence of any compiler optimizations.

However, the compiler is semantically permitted to perform structural
optimization using these magic generic functions. In the concrete case,
this PR introduces the `KeyValue` interface which consists of two
generic functions, `get` and `set`. The core optimization is that the
compiler is allowed to rewrite any occurrence of `get(set(x, k, v), k)`
into `v` without additional legality checks. In particular, the compiler
performs no type checks, conversions, etc. The higher level
implementation code is expected to do all that.

This approach closely matches the general direction we've been taking in
external AbstractInterpreters for embedding additional semantics and
optimization opportunities into Julia code (although we generally use
methods there, rather than full generic functions), so I think we have
some evidence that this sort of approach works reasonably well.

Nevertheless, this is certainly an experiment and the interface is
explicitly declared unstable.

## Current Status

This is fully working and implemented, but the optimization currently
bails on anything but the simplest cases. Filling all those cases in is
not particularly hard, but should be done along with a more invasive
refactoring of SROA, so we should figure out the general direction here
first and then we can finish all that up in a follow-up cleanup.

## Obligatory benchmark
Before:
```
julia> using BenchmarkTools

julia> function foo()
           a = Base.PersistentDict(:a => 1)
           return a[:a]
       end
foo (generic function with 1 method)

julia> @benchmark foo()
BenchmarkTools.Trial: 10000 samples with 993 evaluations.
 Range (min … max):  32.940 ns …  28.754 μs  ┊ GC (min … max):  0.00% … 99.76%
 Time  (median):     49.647 ns               ┊ GC (median):     0.00%
 Time  (mean ± σ):   57.519 ns ± 333.275 ns  ┊ GC (mean ± σ):  10.81% ±  2.22%

        ▃█▅               ▁▃▅▅▃▁                ▁▃▂   ▂
  ▁▂▄▃▅▇███▇▃▁▂▁▁▁▁▁▁▁▁▂▂▅██████▅▂▁▁▁▁▁▁▁▁▁▁▂▃▃▇███▇▆███▆▄▃▃▂▂ ▃
  32.9 ns         Histogram: frequency by time         68.6 ns <

 Memory estimate: 128 bytes, allocs estimate: 4.

julia> @code_typed foo()
CodeInfo(
1 ─ %1  = invoke Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}(Base.HashArrayMappedTries.undef::UndefInitializer, 1::Int64)::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
│   %2  = %new(Base.HashArrayMappedTries.HAMT{Symbol, Int64}, %1, 0x00000000)::Base.HashArrayMappedTries.HAMT{Symbol, Int64}
│   %3  = %new(Base.HashArrayMappedTries.Leaf{Symbol, Int64}, :a, 1)::Base.HashArrayMappedTries.Leaf{Symbol, Int64}
│   %4  = Base.getfield(%2, :data)::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
│   %5  = $(Expr(:boundscheck, true))::Bool
└──       goto JuliaLang#5 if not %5
2 ─ %7  = Base.sub_int(1, 1)::Int64
│   %8  = Base.bitcast(UInt64, %7)::UInt64
│   %9  = Base.getfield(%4, :size)::Tuple{Int64}
│   %10 = $(Expr(:boundscheck, true))::Bool
│   %11 = Base.getfield(%9, 1, %10)::Int64
│   %12 = Base.bitcast(UInt64, %11)::UInt64
│   %13 = Base.ult_int(%8, %12)::Bool
└──       goto JuliaLang#4 if not %13
3 ─       goto JuliaLang#5
4 ─ %16 = Core.tuple(1)::Tuple{Int64}
│         invoke Base.throw_boundserror(%4::Vector{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}, %16::Tuple{Int64})::Union{}
└──       unreachable
5 ┄ %19 = Base.getfield(%4, :ref)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
│   %20 = Base.memoryref(%19, 1, false)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
│         Base.memoryrefset!(%20, %3, :not_atomic, false)::MemoryRef{Union{Base.HashArrayMappedTries.HAMT{Symbol, Int64}, Base.HashArrayMappedTries.Leaf{Symbol, Int64}}}
└──       goto JuliaLang#6
6 ─ %23 = Base.getfield(%2, :bitmap)::UInt32
│   %24 = Base.or_int(%23, 0x00010000)::UInt32
│         Base.setfield!(%2, :bitmap, %24)::UInt32
└──       goto JuliaLang#7
7 ─ %27 = %new(Base.PersistentDict{Symbol, Int64}, %2)::Base.PersistentDict{Symbol, Int64}
└──       goto JuliaLang#8
8 ─ %29 = invoke Base.getindex(%27::Base.PersistentDict{Symbol, Int64}, 🅰️:Symbol)::Int64
└──       return %29
```

After:
```
julia> using BenchmarkTools

julia> function foo()
           a = Base.PersistentDict(:a => 1)
           return a[:a]
       end
foo (generic function with 1 method)

julia> @benchmark foo()
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min … max):  2.459 ns … 11.320 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.460 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.469 ns ±  0.183 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂    █                                              ▁    █ ▂
  █▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁█ █
  2.46 ns      Histogram: log(frequency) by time     2.47 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @code_typed foo()
CodeInfo(
1 ─     return 1
```
Keno added a commit that referenced this pull request Dec 11, 2023
As discussed in #51352, this gives `EnterNode` the ability to set
(and restore on leave or catch edge) jl_current_task->scope. Manual
modifications of the task field after the task has started are
considered undefined behavior. In addition, we gain a new intrinsic
to access current_task->scope and both inference and the optimizer
will forward scopes from EnterNodes to this intrinsic (non-interprocedurally).
Together with #51993 this is sufficient to fully optimize ScopedValues
(non-interprocedurally at least).
Keno added a commit that referenced this pull request Dec 13, 2023
As discussed in #51352, this gives `EnterNode` the ability to set (and
restore on leave or catch edge) jl_current_task->scope. Manual
modifications of the task field after the task has started are
considered undefined behavior. In addition, we gain a new intrinsic to
access current_task->scope and both inference and the optimizer will
forward scopes from EnterNodes to this intrinsic
(non-interprocedurally). Together with #51993 this is sufficient to
fully optimize ScopedValues (non-interprocedurally at least).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:optimizer Optimization passes (mostly in base/compiler/ssair/)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants