-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Customizable lazy fused broadcasting in pure Julia #26891
Conversation
This patch represents the combined efforts of four individuals, over 60 commits, and an iterated design over (at least) three pull requests that spanned nearly an entire year (closes #22063, #23692, #25377 by superceding them). This introduces a pure Julia data structure that represents a fused broadcast expression. For example, the expression `2 .* (x .+ 1)` lowers to: ```julia julia> Meta.@lower 2 .* (x .+ 1) :($(Expr(:thunk, CodeInfo(:(begin Core.SSAValue(0) = (Base.getproperty)(Base.Broadcast, :materialize) Core.SSAValue(1) = (Base.getproperty)(Base.Broadcast, :make) Core.SSAValue(2) = (Base.getproperty)(Base.Broadcast, :make) Core.SSAValue(3) = (Core.SSAValue(2))(+, x, 1) Core.SSAValue(4) = (Core.SSAValue(1))(*, 2, Core.SSAValue(3)) Core.SSAValue(5) = (Core.SSAValue(0))(Core.SSAValue(4)) return Core.SSAValue(5) end))))) ``` Or, slightly more readably as: ```julia using .Broadcast: materialize, make materialize(make(*, 2, make(+, x, 1))) ``` The `Broadcast.make` function serves two purposes. Its primary purpose is to construct the `Broadcast.Broadcasted` objects that hold onto the function, the tuple of arguments (potentially including nested `Broadcasted` arguments), and sometimes a set of `axes` to include knowledge of the outer shape. The secondary purpose, however, is to allow an "out" for objects that _don't_ want to participate in fusion. For example, if `x` is a range in the above `2 .* (x .+ 1)` expression, it needn't allocate an array and operate elementwise — it can just compute and return a new range. Thus custom structures are able to specialize `Broadcast.make(f, args...)` just as they'd specialize on `f` normally to return an immediate result. `Broadcast.materialize` is identity for everything _except_ `Broadcasted` objects for which it allocates an appropriate result and computes the broadcast. It does two things: it `initialize`s the outermost `Broadcasted` object to compute its axes and then `copy`s it. Similarly, an in-place fused broadcast like `y .= 2 .* (x .+ 1)` uses the exact same expression tree to compute the right-hand side of the expression as above, and then uses `materialize!(y, make(*, 2, make(+, x, 1)))` to `instantiate` the `Broadcasted` expression tree and then `copyto!` it into the given destination. All-together, this forms a complete API for custom types to extend and customize the behavior of broadcast (fixes #22060). It uses the existing `BroadcastStyle`s throughout to simplify dispatch on many arguments: * Custom types can opt-out of broadcast fusion by specializing `Broadcast.make(f, args...)` or `Broadcast.make(::BroadcastStyle, f, args...)`. * The `Broadcasted` object computes and stores the type of the combined `BroadcastStyle` of its arguments as its first type parameter, allowing for easy dispatch and specialization. * Custom Broadcast storage is still allocated via `broadcast_similar`, however instead of passing just a function as a first argument, the entire `Broadcasted` object is passed as a final argument. This potentially allows for much more runtime specialization dependent upon the exact expression given. * Custom broadcast implmentations for a `CustomStyle` are defined by specializing `copy(bc::Broadcasted{CustomStyle})` or `copyto!(dest::AbstractArray, bc::Broadcasted{CustomStyle})`. * Fallback broadcast specializations for a given output object of type `Dest` (for the `DefaultArrayStyle` or another such style that hasn't implemented assignments into such an object) are defined by specializing `copyto(dest::Dest, bc::Broadcasted{Nothing})`. As it fully supports range broadcasting, this now deprecates `(1:5) + 2` to `.+`, just as had been done for all `AbstractArray`s in general. As a first-mover proof of concept, LinearAlgebra uses this new system to improve broadcasting over structured arrays. Before, broadcasting over a structured matrix would result in a sparse array. Now, broadcasting over a structured matrix will _either_ return an appropriately structured matrix _or_ a dense array. This does incur a type instability (in the form of a discriminated union) in some situations, but thanks to type-based introspection of the `Broadcasted` wrapper commonly used functions can be special cased to be type stable. For example: ```julia julia> f(d) = round.(Int, d) f (generic function with 1 method) julia> @inferred f(Diagonal(rand(3))) 3×3 Diagonal{Int64,Array{Int64,1}}: 0 ⋅ ⋅ ⋅ 0 ⋅ ⋅ ⋅ 1 julia> @inferred Diagonal(rand(3)) .* 3 ERROR: return type Diagonal{Float64,Array{Float64,1}} does not match inferred return type Union{Array{Float64,2}, Diagonal{Float64,Array{Float64,1}}} Stacktrace: [1] error(::String) at ./error.jl:33 [2] top-level scope julia> @inferred Diagonal(1:4) .+ Bidiagonal(rand(4), rand(3), 'U') .* Tridiagonal(1:3, 1:4, 1:3) 4×4 Tridiagonal{Float64,Array{Float64,1}}: 1.30771 0.838589 ⋅ ⋅ 0.0 3.89109 0.0459757 ⋅ ⋅ 0.0 4.48033 2.51508 ⋅ ⋅ 0.0 6.23739 ``` In addition to the issues referenced above, it fixes: * Fixes #19313, #22053, #23445, and #24586: Literals are no longer treated specially in a fused broadcast; they're just arguments in a `Broadcasted` object like everything else. * Fixes #21094: Since broadcasting is now represented by a pure Julia datastructure it can be created within `@generated` functions and serialized. * Fixes #26097: The fallback destination-array specialization method of `copyto!` is specifically implemented as `Broadcasted{Nothing}` and will not be confused by `nothing` arguments. * Fixes the broadcast-specific element of #25499: The default base broadcast implementation no longer depends upon `Base._return_type` to allocate its array (except in the empty or concretely-type cases). Note that the sparse implementation (#19595) is still dependent upon inference and is _not_ fixed. * Fixes #25340: Functions are treated like normal values just like arguments and only evaluated once. * Fixes #22255, and is performant with 12+ fused broadcasts. Okay, that one was fixed on master already, but this fixes it now, too. * Fixes #25521. * The performance of this patch has been thoroughly tested through its iterative development process in #25377. There remain [two classes of performance regressions](#25377) that Nanosoldier flagged. * #25691: Propagation of constant literals sill lose their constant-ness upon going through the broadcast machinery. I believe quite a large number of functions would need to be marked as `@pure` to support this -- including functions that are intended to be specialized. (For bookkeeping, this is the squashed version of the [teh-jn/lazydotfuse](#25377) branch as of a1d4e7e. Squashed and separated out to make it easier to review and commit) Co-authored-by: Tim Holy <tim.holy@gmail.com> Co-authored-by: Jameson Nash <vtjnash@gmail.com> Co-authored-by: Andrew Keller <ajkeller34@users.noreply.github.com>
Yay!
Won’t these two signatures lead to confusion if someone tries to broadcast a BroadcastStyle as the first (function) argument |
Possibly helped by #26826, but we can likely work towards modifying our heuristics, as needed, to permit analysis of these cases |
This may be a record on closing issues with one PR! (Possibly only beat out by Jeff's type system revamp or adding the package manager, mostly because I just closed all package manager-related issues after that 😹.) My only nit to pick is that I do not love the name |
Yes, but I don't find this all too terrible as
That's fair. The whole point is to allow folks to override the |
👍 Lowercase |
alternately, |
I like using the lowercased name of the type here. I'm open to other ideas, but I think we should think of the names |
NEWS.md
Outdated
@@ -418,6 +413,14 @@ This section lists changes that do not have deprecation warnings. | |||
* `findn(x::AbstractArray)` has been deprecated in favor of `findall(!iszero, x)`, which | |||
now returns cartesian indices for multidimensional arrays (see below, [#25532]). | |||
|
|||
* Broadcasting operations are no longer fused into a single operation by Julia's parser. | |||
Instead, a lazy `Broadcasted` wrapper is created, and the parser will call | |||
`copy(bc::Broadcasted)` or `copyto!(dest, bc::Broadcasted)` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the lowering actually calls Broadcast.materialize
or Broadcast.materialize!
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yes, that's true. I still want to highlight copy
and copyto!
as those are the methods to specialize. Let's just remove that "parser will call" to make it technically correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why the parser doesn't lower to these copy
calls directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. There are several reasons here, and they're different for in-place (broadcast!
/ .=
) vs. out-of-place broadcasting.
For in-place broadcasts, we use materialize!
for two purposes: (1) the outermost broadcasted
might not return a Broadcasted
object, but we need to support broadcasting into the destination (e.g., copyto!(rand(3,3), 1:3)
doesn't broadcast) and we want to use the same code-paths here. It solves both problems at once by wrapping the non-Broadcasted
object x
in a Broadcasted(identity, (x,))
. (2) The Broadcasted
argument wants to know about its axes, but those are dependent upon the axes of the destination. It explicitly sets the axes of the Broadcasted
object to those of the destination, and then those are checked to ensure they are a compatible shape with instantiate
.
For out-of-place broadcasts, materialize
also serves two purposes: (1) for Broadcasted
arguments, it computes and caches the axes of the outermost Broadcasted
, recursively checking the axes of its arguments by instantiate
-ing them. Now, we may be able to get away with not storing the axes
since axes
should theoretically only be called once and it can be computed on demand, but I haven't tested it and it seems strange to throw a DimensionError
from a call to axes
. (2) for all other arguments it defaults to identity
but can serve as a hook for custom types to do similar sorts of specializations decoupled from copy
.
I'm also aiming for some semblance of symmetry between the two cases.
Now there are two is one caveat here: (1) A wart that I really dislike is (Fixed in 8a2d88a — this hack is no longer necessary). (2) The expression broadcast
itself is hard-coded to call copy(instantiate(broadcasted(…)))
instead of materialize
. For reasons beyond my comprehension, materialize
sometimes failed to inline even though it is explicitly marked as @inline
.A .= x
still lowers to broadcast!(identity, A, x)
. If otherwise unspecialized, it will fall back to the default implementation, which constructs the appropriate Broadcasted
object and then goes through the above machinery as expected. We may want to change this purely to be able to simplify our description of this parser transformation.
Yes, we see that capitalized/lowercase idiom elsewhere, and I think it makes sense here. See e.g. |
Even so, it leads to some weird side-effects, such as: |
For reasons beyond my comprehensions, this previously failed to inline despite the at-inline. For reasons that are also beyond my comprehension, this hack is no longer necessary.
Nifty feature. Seriously, though, do you have any ideas on a replacement API here that is similarly easy to specialize? I simply don't find this all that troublesome. |
by implementing a preprocessing step for chunked bitarray broadcast that converts boolean-only functions to ones that support chunk-wise operation. Also add a number of tests for this codepath
I've edited the first post to reflect the renaming of |
Ok, one last question from the now-hidden code comment: what should
With no specializations beyond those in base, all are equivalent. I would prefer 1 or 3, but don't have terribly strong feelings beyond that. |
Unrelated, but building on the capitalized/lowercase idiom, it might be worthwhile considering renaming Generator to Map. |
Option 3 seems like it gives the most flexibility, right? |
Now instead of lowering to `broadcast!(identity,A,x)`, it lowers to `Broadcast.materialize!(A, Broadcast.broadcasted(identity, x))`, with all names resolved in the top module.
Ok, yes, I do like this better. It reduces the number of names that lowering calls and should be simpler to document. It's still a special case — you still need to explain where that |
How serious is your objection to the |
Actually, I'm coming around to agree. Let's merge! :) |
Travis failure was #26902, AppVeyor x86_64 was a timeout. |
🎉 |
Currently doesn't work due to the new changes from JuliaLang/julia#26891
Currently doesn't work due to the new changes from JuliaLang/julia#26891
Currently doesn't work due to the new changes from JuliaLang/julia#26891
I have not had any time for general development lately, and that's likely to continue for a little while yet. So just wanted to give a big shout-out to @mbauman and anyone else who pushed this over the finish line. Many thanks for seeing it through! |
Currently doesn't work due to the new changes from JuliaLang/julia#26891
This patch represents the combined efforts of four individuals, over 60 commits, and an iterated design over (at least) three pull requests that spanned nearly an entire year (closes #22063, closes #23692, closes #25377 by superseding them).
What does it do?
This introduces a pure Julia data structure that represents a fused broadcast
expression. For example, the expression
2 .* (x .+ 1)
lowers to:Or, slightly more readably as:
The
broadcasted
function serves two purposes. Its primary purpose is toconstruct the
Broadcast.Broadcasted
objects that hold onto the function, thetuple of arguments (potentially including nested
Broadcasted
arguments), andsometimes a set of
axes
to include knowledge of the outer shape. Thesecondary purpose, however, is to allow an "out" for objects that don't want
to participate in fusion. For example, if
x
is a range in the above2 .* (x .+ 1)
expression, it needn't allocate an array and operate elementwise — itcan just compute and return a new range. Thus custom structures are able to
specialize
Broadcast.broadcasted(f, args...)
just as they'd specialize onf
normally to return an immediate result.
Broadcast.materialize
is identity for everything exceptBroadcasted
objects for which it allocates an appropriate result and computes the
broadcast. It does two things: it
initialize
s the outermostBroadcasted
object to compute its axes and then
copy
s it.Similarly, an in-place fused broadcast like
y .= 2 .* (x .+ 1)
uses the exactsame expression tree to compute the right-hand side of the expression as above,
and then uses
materialize!(y, broadcasted(*, 2, broadcasted(+, x, 1)))
toinstantiate
theBroadcasted
expression tree and thencopyto!
it into the given destination.All-together, this forms a complete API for custom types to extend and
customize the behavior of broadcast (fixes #22060). It uses the existing
BroadcastStyle
s throughout to simplify dispatch on many arguments:Custom types can opt-out of broadcast fusion by specializing
Broadcast.broadcasted(f, args...)
orBroadcast.broadcasted(::BroadcastStyle, f, args...)
.The
Broadcasted
object computes and stores the type of the combinedBroadcastStyle
of its arguments as its first type parameter, allowing foreasy dispatch and specialization.
Custom Broadcast storage is still allocated via
broadcast_similar
, howeverinstead of passing just a function as a first argument, the entire
Broadcasted
object is passed as a final argument. This potentially allowsfor much more runtime specialization dependent upon the exact expression
given.
Custom broadcast implmentations for a
CustomStyle
are defined byspecializing
copy(bc::Broadcasted{CustomStyle})
orcopyto!(dest::AbstractArray, bc::Broadcasted{CustomStyle})
.Fallback broadcast specializations for a given output object of type
Dest
(for the
DefaultArrayStyle
or another such style that hasn't implementedassignments into such an object) are defined by specializing
copyto(dest::Dest, bc::Broadcasted{Nothing})
.How do base and the standard libraries take advantage?
As it fully supports range broadcasting, this now deprecates
(1:5) + 2
to.+
, just as had been done for allAbstractArray
s in general.BitArray is able to exploit this new structure to great effect. When broadcasting
into a BitArray, it is able to scan through the
Broadcasted
expression tree tosee if it'd be able to support fusing at the level of chunks, operating on 64 bits
at a time. This is a huge performance gain on the order of 100x or more. The
simple
@benchmark $A .& $B .| .!$C
demonstrates a 130x performance gainover master for
A, B, C = bitrand.((1000,1000,1000))
(50ns vs 6500ns). Thenumbers are even greater for larger structures.
As a first-mover proof of concept, LinearAlgebra uses this new system to
improve broadcasting over structured arrays. Before, broadcasting over a
structured matrix would result in a sparse array. Now, broadcasting over a
structured matrix will either return an appropriately structured matrix or
a dense array. This does incur a type instability (in the form of a
discriminated union) in some situations, but thanks to type-based introspection
of the
Broadcasted
wrapper commonly used functions can be special cased to betype stable. For example:
Sparse arrays already had fairly extensive broadcasting support, but it was all based around broadcasting a single function over a set of arguments. This PR implements broadcasting as an expression tree of potentially many nested functions and arguments, but to support this common case, a
Broadcast.flatten
function is provided to convert the expression tree into a single function and set of arguments. This allows custom implementations a way to opt into a simpler API in their implementation.What more does it fix?
In addition to the issues referenced above, it fixes:
Fixes Element-wise call syntax does not always lower to standard broadcast() #19313, fixes x.*x not customizable #22053, fixes Fix docs explaining broadcasting, or change behaviour of
min.(1,f)
#23445, and fixes UnitRange broacasting with literals gives a Vector #24586: Literals are no longer treatedspecially in a fused broadcast; they're just arguments in a
Broadcasted
object like everything else.
Fixes
generated function body is not pure
when using dot notation #21094: Since broadcasting is now represented by a pure Juliadatastructure it can be created within
@generated
functions and serialized.Fixes Bug in broadcast! involving nothing #26097: The fallback destination-array specialization method of
copyto!
is specifically implemented asBroadcasted{Nothing}
and will notbe confused by
nothing
arguments.Fixes the broadcast-specific element of Base._return_type usages wrong for Type return #25499: The default base broadcast
implementation no longer depends upon
Base._return_type
to allocate itsarray (except in the empty or concretely-type cases). Note that the sparse
implementation (#19595) is still dependent upon inference and is not fixed.
Fixes nested broadcast should evaluate non-broadcasted part once #25340: Functions are treated like normal values just like arguments
and only evaluated once.
Fixes Performance issues with fusing 12+ broadcasts #22255, and is performant with 12+ fused broadcasts. Okay, that one was
fixed on master already, but this fixes it now, too.
Fixes StepRange + Float gives incorrect length #25521.
What's not to love?
The performance of this patch has been thoroughly tested through its
iterative development process in RFC: Customizable lazy broadcasting with options for pure-Julia fusion and eager evaluation #25377. There remain two classes of
performance regressions that Nanosoldier flagged.
getfield broadcast on empty collection doesn't infer #25691: Propagation of constant literals sill lose their constant-ness upon
going through the broadcast machinery. I believe quite a large number of
functions would need to be marked as
@pure
to support this -- includingfunctions that are intended to be specialized.
(For bookkeeping, this is the squashed version of the teh-jn/lazydotfuse
branch as of a1d4e7e. Squashed and separated
out to make it easier to review and commit)
Co-authored-by: @timholy
Co-authored-by: @vtjnash
Co-authored-by: @ajkeller34