Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make dataids and mightalias API #51753

Open
jakobnissen opened this issue Oct 18, 2023 · 5 comments
Open

Make dataids and mightalias API #51753

jakobnissen opened this issue Oct 18, 2023 · 5 comments
Labels
arrays [a, r, r, a, y, s] broadcast Applying a function over a collection design Design of APIs or of the language itself feature Indicates new feature / enhancement requests performance Must go faster

Comments

@jakobnissen
Copy link
Contributor

Famously, in the Julia docs, it states:

The only interfaces that are stable with respect to SemVer of julia version are the Julia Base and standard libraries interfaces described in the documentation and not marked as unstable (e.g., experimental and internal)

Base.dataids and Base.mightalias are neither in the documentation, nor are they exported. However, the docstring of dataids state:

Custom arrays that would like to opt-in to aliasing detection of their component parts can specialize this method to return the concatenation of the dataids of their component parts

Custom arrays, of course, should not extend internal Base methods that are subject to change or deletion. So either the encouragement to extend dataids should be removed, or dataids and mightalias should be documented.

However, see #50820 : I believe the current implementation of mightalias is a huge footgun. So if it's not possible to actually make this work reliably, maybe it's better to have this be explicitly marked internal, to avoid confusion.

@brenhinkeller brenhinkeller added the feature Indicates new feature / enhancement requests label Oct 20, 2023
@jishnub
Copy link
Contributor

jishnub commented Dec 7, 2023

Another one to consider is Base.unaliascopy, which is now recommended in an error message on nightly. E.g.:

ArgumentError: an array of type `MyLazyArray` shares memory with another argument
  and must make a preventative copy of itself in order to maintain consistent semantics,
  but `copy(::MyLazyArray{Float64, 1})` returns a new array of type `Vector{Float64}`.
  To fix, implement:
      `Base.unaliascopy(A::MyLazyArray)::typeof(A)`

@nsajko
Copy link
Contributor

nsajko commented May 4, 2024

The fact that dataids isn't public and thus mustn't be overloaded by packages (without pinning specific Julia versions in Project.toml) causes over 2x slowdowns in the following example:

julia> using Chairmarks

julia> typeof(x)
FixedSizeVector{FixedSizeVector{FixedSizeVector{FixedSizeVector{FixedSizeVector{Float64, Memory{Float64}}, Memory{FixedSizeVector{Float64, Memory{Float64}}}}, Memory{FixedSizeVector{FixedSizeVector{Float64, Memory{Float64}}, Memory{FixedSizeVector{Float64, Memory{Float64}}}}}}, Memory{FixedSizeVector{FixedSizeVector{FixedSizeVector{Float64, Memory{Float64}}, Memory{FixedSizeVector{Float64, Memory{Float64}}}}, Memory{FixedSizeVector{FixedSizeVector{Float64, Memory{Float64}}, Memory{FixedSizeVector{Float64, Memory{Float64}}}}}}}}, Memory{FixedSizeVector{FixedSizeVector{FixedSizeVector{FixedSizeVector{Float64, Memory{Float64}}, Memory{FixedSizeVector{Float64, Memory{Float64}}}}, Memory{FixedSizeVector{FixedSizeVector{Float64, Memory{Float64}}, Memory{FixedSizeVector{Float64, Memory{Float64}}}}}}, Memory{FixedSizeVector{FixedSizeVector{FixedSizeVector{Float64, Memory{Float64}}, Memory{FixedSizeVector{Float64, Memory{Float64}}}}, Memory{FixedSizeVector{FixedSizeVector{Float64, Memory{Float64}}, Memory{FixedSizeVector{Float64, Memory{Float64}}}}}}}}}} (alias for FixedSizeArray{FixedSizeArray{FixedSizeArray{FixedSizeArray{FixedSizeArray{Float64, 1, GenericMemory{:not_atomic, Float64, Core.AddrSpace{Core}(0x00)}}, 1, GenericMemory{:not_atomic, FixedSizeArray{Float64, 1, GenericMemory{:not_atomic, Float64, Core.AddrSpace{Core}(0x00)}}, Core.AddrSpace{Core}(0x00)}}, 1, GenericMemory{:not_atomic, FixedSizeArray{FixedSizeArray{Float64, 1, GenericMemory{:not_atomic, Float64, Core.AddrSpace{Core}(0x00)}}, 1, GenericMemory{:not_atomic, FixedSizeArray{Float64, 1, GenericMemory{:not_atomic, Float64, Core.AddrSpace{Core}(0x00)}}, Core.AddrSpace{Core}(0x00)}}, Core.AddrSpace{Core}(0x00)}}, 1, GenericMemory{:not_atomic, FixedSizeArray{FixedSizeArray{FixedSizeArray{Float64, 1, GenericMemory{:not_atomic, Float64, Core.AddrSpace{Core}(0x00)}}, 1, GenericMemory{:not_atomic, FixedSizeArray{Float64, 1, GenericMemory{:not_atomic, Float64, Core.AddrSpace{Core}(0x00)}}, Core.AddrSpace{Core}(0x00)}}, 1, GenericMemory{:not_atomic, FixedSizeArray{FixedSizeArray{Float64, 1, GenericMemory{:not_atomic, Float64, Core.AddrSpace{Core}(0x00)}}, 1, GenericMemory{:not_atomic, FixedSizeArray{Float64, 1, GenericMemory{:not_atomic, Float64, Core.AddrSpace{Core}(0x00)}}, Core.AddrSpace{Core}(0x00)}}, Core.AddrSpace{Core}(0x00)}}, Core.AddrSpace{Core}(0x00)}}, 1, GenericMemory{:not_atomic, FixedSizeArray{FixedSizeArray{FixedSizeArray{FixedSizeArray{Float64, 1, GenericMemory{:not_atomic, Float64, Core.AddrSpace{Core}(0x00)}}, 1, GenericMemory{:not_atomic, FixedSizeArray{Float64, 1, GenericMemory{:not_atomic, Float64, Core.AddrSpace{Core}(0x00)}}, Core.AddrSpace{Core}(0x00)}}, 1, GenericMemory{:not_atomic, FixedSizeArray{FixedSizeArray{Float64, 1, GenericMemory{:not_atomic, Float64, Core.AddrSpace{Core}(0x00)}}, 1, GenericMemory{:not_atomic, FixedSizeArray{Float64, 1, GenericMemory{:not_atomic, Float64, Core.AddrSpace{Core}(0x00)}}, Core.AddrSpace{Core}(0x00)}}, Core.AddrSpace{Core}(0x00)}}, 1, GenericMemory{:not_atomic, FixedSizeArray{FixedSizeArray{FixedSizeArray{Float64, 1, GenericMemory{:not_atomic, Float64, Core.AddrSpace{Core}(0x00)}}, 1, GenericMemory{:not_atomic, FixedSizeArray{Float64, 1, GenericMemory{:not_atomic, Float64, Core.AddrSpace{Core}(0x00)}}, Core.AddrSpace{Core}(0x00)}}, 1, GenericMemory{:not_atomic, FixedSizeArray{FixedSizeArray{Float64, 1, GenericMemory{:not_atomic, Float64, Core.AddrSpace{Core}(0x00)}}, 1, GenericMemory{:not_atomic, FixedSizeArray{Float64, 1, GenericMemory{:not_atomic, Float64, Core.AddrSpace{Core}(0x00)}}, Core.AddrSpace{Core}(0x00)}}, Core.AddrSpace{Core}(0x00)}}, Core.AddrSpace{Core}(0x00)}}, Core.AddrSpace{Core}(0x00)}})

julia> typeof(y)
Vector{Vector{Vector{Vector{Vector{Float64}}}}} (alias for Array{Array{Array{Array{Array{Float64, 1}, 1}, 1}, 1}, 1})

julia> @b x g
4.808 s (22222000 allocs: 2.483 GiB, 22.44% gc time, without a warmup)

julia> @b y g
2.357 s (44444000 allocs: 2.980 GiB, 25.67% gc time, without a warmup)

julia> Base.dataids(a::FixedSizeArray) = Base.dataids(a.mem)

julia> @b x g
2.094 s (22222000 allocs: 2.483 GiB, 22.26% gc time, without a warmup)

julia> versioninfo()
Julia Version 1.12.0-DEV.460
Commit 9d59ecc66fd (2024-05-03 17:04 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × AMD Ryzen 3 5300U with Radeon Graphics
  WORD_SIZE: 64
  LLVM: libLLVM-17.0.6 (ORCJIT, znver2)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)

Interpretation: without overloading dataids, g is more than two times slower for FixedSizeVector than for Vector, but when dataids gets overloaded g becomes faster for FixedSizeVector than for Vector. Profiling shows that, before overloading, dataids for FixedSizeVector spends almost all of its time in objectid.

FixedSizeVector is from the experimental package FixedSizeArrays.jl, @giordano, xref JuliaArrays/FixedSizeArrays.jl#53.

The function g is implemented like so:

f(a, b) = a + b
f(a) = f(a, a)
function g(a, n = 2000)
  T = typeof(a)::Type
  for _  Base.OneTo(n)
    a = f(a)::T
  end
  a
end

@nsajko nsajko added performance Must go faster design Design of APIs or of the language itself labels May 4, 2024
@nsajko nsajko added the broadcast Applying a function over a collection label May 5, 2024
@nsajko
Copy link
Contributor

nsajko commented Sep 9, 2024

FTR this is kinda my plan for tackling this, @giordano:

  • The mightalias function doesn't seem like it should be public API. Defining a mightalias method for each pair of array types doesn't seem viable at all. So there's nothing to do there, I believe.
  • The dataids function, or something like it, should IMO be made public API of Julia. However, I don't think the dataids interface is currently very good, so I plan on improving it, hopefully in a backwards compatible manner, before we try to publicize the aliasing API.
    • Currently dataids returns Tuple{Vararg{UInt}}. The problem with this is that it basically forces each array type to pretend its data lives in the same "address space" as with all other array types. This is fine for most array types which just wrap Memory (directly or indirectly), but it seems like a bad design in general. For example, one could very well design a family of array types, so that none of these array types rely on Memory, but still with the possibility of aliasing. For both performance and correctness, such a family of types would require it's dataids to be something else than UInt, so the dataids could be distinguished from the Memory dataids. Thus I've been thinking a long time about the possible designs for suitable heterogeneous collection types of compile-time constant size (Tuple is perhaps OK, but not a perfect choice). Such a heterogeneous collection doesn't necessarily need an ordering, but the isdisjoint operation needs to be efficient, even for long and very heterogeneous instances. I think I have this figured out by now.
    • Another potential issue with the current design: it assumes that any array value may be represented as a collection of dataids, so it checks for aliasing between two distinct array values by checking if their associated dataids collections (as returned by dataids) are disjoint. However, I'm not completely sure that a better design couldn't exist. Could we represent the aliasing situation for an array value more precisely than by just using a collection of dataids? Perhaps instead of comparing the dataids for equality, we should really be calling isdisjoint on the individual dataids, too, recursively?

@nsajko
Copy link
Contributor

nsajko commented Sep 9, 2024

Another issue with dataids is this:

julia> struct S end

julia> Base.dataids(S())
()

xref #50820

Basically, dataids has a fallback that returns an empty collection. Thus arrays of unknown type are assumed not to alias anything. Instead there should simply be a function like implements_dataids(::YourArrayType)::Bool.

@mbauman
Copy link
Member

mbauman commented Sep 9, 2024

The problem with this is that it basically forces each array type to pretend its data lives in the same "address space" as with all other array types.

Of all the problems with dataids, I don't think this is one worth consternating over. We could choose to hash Memory's pointers to intentionally destroy any structure they have, I suppose, but either way we're looking at — effectively — a hash collision here. And the collision case is a speed bump.

Perhaps instead of comparing the dataids for equality, we should really be calling isdisjoint on the individual dataids, too, recursively?

One of the most important design criteria here is to be fast in the no-copy-case, but this could definitely be a good idea. Other aliasing detection systems often have an "effort" sort of argument that allow you to do a quickmightalias(A, B) ? (slowmightalias(A, B) ? copy(A) : A) : A — numpy has some good inspiration here. It's worth noting that one of my design drafts here had dataids be a tuple of UnitRanges that expressed the memory extents, but I found that to be more complicated than it's worth (and would have really exacerbated your first point above).

Basically, dataids has a fallback that returns an empty collection

This is — IMO — definitely the biggest flaw with the current system, borne out of the "incremental improvement" where we weren't doing aliasing checks at all. See #26237 (comment) for my latest attempt at improving this.


One other small consideration that seems like it could have an outsized effect is that we could have a distinction between "read-only" dataids and "writeable" dataids. Most commonly this appears with SubArrays. Writing to a SubArray will not affect its indices but unalias doesn't know about that... and so you can easily get spurious copies in the fairly common case where two indices are the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrays [a, r, r, a, y, s] broadcast Applying a function over a collection design Design of APIs or of the language itself feature Indicates new feature / enhancement requests performance Must go faster
Projects
None yet
Development

No branches or pull requests

5 participants