Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Customizable lazy broadcasting with options for pure-Julia fusion and eager evaluation #25377

Closed
wants to merge 62 commits into from

Conversation

timholy
Copy link
Member

@timholy timholy commented Jan 3, 2018

This PR (WIP) implements broadcasting using a lazy wrapper. It's a collaborative effort with @vtjnash, following up on the branch described here. It's a WIP because some of the sparse/higherorderfns tests are allocating and/or hitting inference limits, and there may be other bugs I haven't yet addressed.

Having gotten most of the way through this, I would characterize this as an attractive but dangerous change. It fixes numerous problems but also risks introducing a whole bunch of fun new bugs. For a few people it might make broadcasting harder to support than how it works now, but for others they will finally be able to use broadcasting.

The overall strategy is to create Broadcasted wrappers, and then copy them (or copyto! for broadcast!) to eagerly evaluate. There's also an intermediate step of instantiation, which just finishes various elements of the wrapper (the indexing information). Certain packages like StaticArrays.jl won't want to use the indexing operations, so this allows them to avoid the performance hit that comes from instantiate.

The trickiest part of all this is that Broadcasted wrappers can be nested, for example in 2 .* (x .+ 1), which leads to Broadcasted(*, 2, Broadcasted(+, x, 1)). In general, what this means is that users may have to walk the tree to implement broadcasting, which may not be a simple operation. To simplify this, I provide two approaches: (1) Broadcasted containers can be flattened, using pure-Julia code to perform the same fusion operations we used to do with in julia-syntax.scm (this is used for implementing sparse broadcasting); (2) there's support for straightforward "incremental" (eager) evaluation, starting with the deepest layers and working your way up (this is used for supporting broadcasting on AbstractRanges). Hopefully some combination of these two will ease most folks' pain.

One immediately practical consequence is that certain operations are much harder on inference, since more operations are handled by pure-Julia code. While I haven't timed carefully, I suspect slowdowns in most or all of the following tests:

  • linalg/dense
  • sparse/sparsevector
  • sparse/higherorderfns
  • linalg/triangular

Achievements:

@timholy timholy added the needs docs Documentation for this change is required label Jan 3, 2018
@mbauman mbauman added broadcast needs news A NEWS entry is required for this change labels Jan 3, 2018
@timholy
Copy link
Member Author

timholy commented Jan 3, 2018

It's worth pointing out an interesting feature of this approach, since it addresses a concern about "poisoning". Consider x = 1:5 and the operation 2 .+ (x .+ (x .+ 1)) .- 1. Because it only involves numbers and ranges, each step of this will be eagerly evaluated and it will return the value 4:2:12, i.e., still a range.

Now change this to 2 .+ (x .* (x .+ 1)) .- 1. It will eagerly evaluate until it gets to the .*, at which point it will realize that the result can't be captured by an AbstractRange, and it will store the intermediate result as a Vector{Int}. From that point on, the remaining calculation is fused. So in a sense this circumvents the "there is no middle ground" problem, though it still suffers from allocating a temporary.

@andyferris
Copy link
Member

and it will store the intermediate result as a Vector{Int}

Couldn't it store the intermediate result as a Broadcasted instead?

@timholy
Copy link
Member Author

timholy commented Jan 4, 2018

Once you've decided to eagerly evaluate, the problem is that (at least with the current implementation) storing as a Broadcasted will put you into a non-terminating recursion. But hmm, maybe one could construct it as a Broadcasted{VectorStyle}; that seems likely to circumvent the problem and avoid the creation of the temporary. I can play with that.

The other strategy is to figure out at the beginning that it won't return a range. That's a little non-trivial though, because it depends on which operations occur in which order. For example, if r is a UnitRange, then r .- 1 returns a ::UnitRange but 1 .- r returns a StepRange. With deep nesting, it seems like it might be pretty complicated to write something that would figure this out at compile-time (i.e., inferrably). So I took the easy way out and evaluated incrementally.

@timholy
Copy link
Member Author

timholy commented Jan 4, 2018

I now know more about the test failures. Here are the important ones:

  • core.jl: Hang in subtyping with sufficiently-complex Union #25388
  • numbers.jl: it's the literal_pow problem, replicable with [2, 4, 6].^-2 (which on this branch yields a DomainError)
  • there are many broadcast methods for special array types like Bidiagonal. These have to become copy(::Broadcasted) methods before this can be merged.
  • finally, the allocation and inference-limits failures in sparse/higherorderfcns.jl have to be addressed. The allocation is more serious, I'm less worried about the inference limits, because it's close now (works for 7 arguments but not 8). Presumably there's a single parameter that can be tweaked, or we can just live with 7 arguments.

Because it's not obvious this is going to be merged for 1.0, I will likely move on with the rest of my life (this has taken a lot of time...) and come back to these later, unless I hear there's interest in getting it into 1.0.

@Sacha0
Copy link
Member

Sacha0 commented Jan 4, 2018

(Triage seemed generally favorable pending further review.

Edit: To expand, triage wasn't terribly worried about e.g. the minor impact on sparse broadcast and presently <30-50% performance regressions mentioned elsewhere. Triage seemed uniformly enthused by the prospect of removing the literal slurping in broadcast lowering, and similarly interested in the other benefits this work might provide (e.g. enabling better accelerator support).)

@andyferris
Copy link
Member

I'd love this for v1.0, assuming it is feasible to do so.

@inline broadcast!(f::Tf, dest, ::BroadcastStyle, As::Vararg{Any,N}) where {Tf,N} = broadcast!(f, dest, nothing, As...)

# Default behavior (separated out so that it can be called by users who want to extend broadcast!).
@inline function broadcast!(f, dest, ::Nothing, As::Vararg{Any, N}) where N
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hooray for removing this method. I was just about to open this issue:

julia> X = Any[1,2]
2-element Array{Any,1}:
 1
 2

julia> X .= missing
2-element Array{Any,1}:
 missing
 missing

julia> X .= nothing
ERROR: MethodError: no method matching _broadcast!(::typeof(identity), ::Array{Any,1})
Closest candidates are:
  _broadcast!(::Any, ::AbstractArray, ::K, ::ID, ::AT, ::BT, ::Val{N}, ::Any) where {K, ID, AT, BT, N} at broadcast.jl:377
  _broadcast!(::Any, ::AbstractArray, ::K, ::ID, ::AT, ::Val{nargs}, ::Any, ::Any, ::Any) where {K, ID, AT, nargs} at broadcast.jl:489
  _broadcast!(::Any, ::Any, ::Any, ::Any...) where N at broadcast.jl:477
  ...
Stacktrace:
 [1] broadcast!(::Function, ::Array{Any,1}, ::Nothing) at ./broadcast.jl:455
 [2] top-level scope

but it looks like your branch will fix it (I've not tested to confirm, though). I can still open that issue if you want a bit more motivation here… :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great test, thanks! It wasn't fully fixed here, so this was useful.

@SimonDanisch
Copy link
Contributor

I want to make my broadcast implementation in GPUArrays 0.7 compatible.
Will the API change a lot, or can I easily make things compatible with master, and once this is merged there won't be much work to support the changes from this branch?

If the first applies, would it make sense to base my refactor on this branch, or is there only a small chance that this will get merged any time soon?

timholy and others added 8 commits January 7, 2018 15:04
The main one left is in concatenation, in a line

    inds[i] = offsets[i] .+ cat_indices(x, i)
If you've already said `using Test`, defining a function named `Test` causes problems.
This is consistent with the deprecation of methods like `[1,2,3] + 1`.
Among other things, this supports returning AbstractRanges for appropriate inputs.

Fixes #21094, fixes #22053
@timholy timholy removed needs docs Documentation for this change is required needs news A NEWS entry is required for this change labels Jan 7, 2018
@timholy
Copy link
Member Author

timholy commented Jan 7, 2018

Executive Busy developer summary

OK, except for Julia bugs I think this is basically finished: it has docs and NEWS, and test results all "make sense" (see subsection below). The docs in particular should make reviewing this easier. I'll flag two issues that merit particular discussion (see subsections below).

I may be able to work on the subtyping/inference bugs, but given how much I've sunk into this already I'll wait for at least one substantive review to find out if any of this is worth it 😄.

Finally, I tried to reorder the commits to separate out functionality, but the date stamps messed up the ordering as displayed by GitHub. Three or four of the commits are perhaps of value even if we don't merge the main work here; it's probably pretty obvious which ones these are.

Preventing circular evaluation (StackOverflowErrors)

As described above, there is an "incremental" evaluation path. In its current state, failure to provide one of the necessary broadcast methods will result in circular evaluation and a StackOverflowError. It's presumably fixable if we require users to specialize this syntax:

copy(bc::Broadcasted{Style}, from_incremental::Bool)

rather than the single-argument method copy(bc). (Conveniently, the general technique was illustrated just today in a discourse post.) That has some obvious negatives: a 2-arg copy is weird, and even types that don't use incremental evaluation are still required to implement the 2-arg version.

Consider allowing a non-inferrable return type from f.(::StructuredMatrix)

If you define StructuredMatrix = Union{Diagonal,Bidiagonal,SymTridiagonal,Tridiagonal}, then currently on this branch f.(::T) where T<:StructuredMatrix infers as a Union{T,SparseMatrixCSC}. The reason is that there's a test for whether f(0) == 0; if so, it returns type T, and otherwise a (densified) SparseMatrixCSC. This represents a generalization & replacement of quite a few methods such as

broadcast(::typeof(ceil), D::Diagonal) = Diagonal(...)

We were a bit hit-and-miss about which of these were present (Diagonal did not have as many of these methods as some of the others), and only a few functions were covered. Conversely, the strategy here is exhaustive in covering all functions, at the cost having the return type inferred as a Union.

Of course I can change this back to always return a SparseMatrixCSC, but I thought I'd see whether folks think that our Union-splitting is good enough that this becomes attractive. If we like this, then perhaps I should change it to return a Union{T,<:Matrix} since the only advantage of returning a SparseMatrixCSC is for inferrability.

Test results

Unless stated otherwise, tests pass locally.

  • test/numbers.jl fails because of the literal_pow issue, and I don't personally have a plan for dealing with that---if we want that test to pass, presumably it will take collaboration with someone who speaks lisp more fluently than I.
  • test/core.jl fails because of Hang in subtyping with sufficiently-complex Union #25388
  • test/sparse/higherorderfcns.jl fails because of some inference Heisenbug. All I have to do is force one of the methods to recompile in the running julia session and many more tests (all those with fewer than 8 args) pass. The 8-arg case still fails.
  • that said, when inference gets fixed, it will pass only because I annotated the @allocated(expr) == 0 tests in test/sparse/higherorderfcns.jl as @test_broken. Despite adding @inlines willy-nilly I've been unable to get rid of all the allocations. (Maybe I'm missing some.)

@timholy
Copy link
Member Author

timholy commented Jan 7, 2018

@SimonDanisch, I'd recommend holding off for a few more days if you can, since I think we'll know the ultimate fate of this by then.

the results of ([`Base.BroadcastStyle`](@ref)) applied to the argument types
- execution of `copy(bc::Broadcasted{DestStyle})`, which in simple cases only requires that
you allocate the output with [`Base.broadcast_similar`](@ref). In more complex cases, you may
wish to specialize `copy` and/or `copy!` for `DestStyle`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copyto!? :)

Also, did you you mean to number the stages or so?

@mbauman
Copy link
Member

mbauman commented Jan 9, 2018

I wasn't quite clear what this ended up looking like, myself, so as a first pass of a review I figured it'd be helpful to others to see what this actually does. For now this is just a scratchpad for my own notes.

High-level end-user summary of fusion

julia> Meta.lower(Main, :(@. 3 + 2x + sin(x)))
:($(Expr(:thunk, CodeInfo(:(begin
      SSAValue(0) = (Base.getproperty)(Base.Broadcast, :execute)
      SSAValue(1) = (Base.getproperty)(Base.Broadcast, :make)
      SSAValue(2) = (Base.getproperty)(Base.Broadcast, :make)
      SSAValue(3) = (SSAValue(2))(*, 2, x)
      SSAValue(4) = (Base.getproperty)(Base.Broadcast, :make)
      SSAValue(5) = (SSAValue(4))(sin, x)
      SSAValue(6) = (SSAValue(1))(+, 3, SSAValue(3), SSAValue(5))
      SSAValue(7) = (SSAValue(0))(SSAValue(6))
      return SSAValue(7)
  end)))))

In other words, @. 3 + 2x + sin(x) lowers to:

julia> using Base.Broadcast: make, execute, instantiate

julia> x = 1:10
1:10

julia> bc = make(+, 3, make(*, 2, x), make(sin, x))
Broadcasted(+, 3, Broadcasted(*, 2, 1:10), Broadcasted(sin, 1:10))

julia> execute(bc)
10-element Array{Float64,1}:
 ⋮

Simple enough. make is basically a Broadcasted constructor (the type that represents the lazy function application), and execute demarcates the end of the laziness. execute is very simple, too — it's just defined as execute(x) = copy(instantiate(x)).

Nitty-gritty details

This belies a goodly amount of complexity, as evidenced by:

julia> typeof(bc)
Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Base.Broadcast.Unknown,Nothing,Nothing,typeof(+),Base.TupleLL{Int64,Base.TupleLL{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Base.Broadcast.Unknown,Nothing,Nothing,typeof(*),Base.TupleLL{Int64,Base.TupleLL{UnitRange{Int64},Base.TupleLLEnd}}},Base.TupleLL{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Base.Broadcast.Unknown,Nothing,Nothing,typeof(sin),Base.TupleLL{UnitRange{Int64},Base.TupleLLEnd}},Base.TupleLLEnd}}}}

Whoa whoa whoa. What's that? Okay, well, Broadcasted has a bunch of type parameters that help identify how the broadcasting should be done. Specifically, it's Broadcasted{Style<:Union{Nothing,BroadcastStyle}, ElType, Axes, Indexing<:Union{Nothing,TupleLL}, F, Args<:TupleLL}. As you can see in the above type, lots of those parameters are Broadcast.Unknown or Nothing. They haven't been decided yet — that happens with the instantiate function:

julia> bc′ = instantiate(bc)
       typeof(bc′)
Base.Broadcast.Broadcasted{
    #= Style  =# Base.Broadcast.DefaultArrayStyle{1},
    #= ElType =# Float64,
    #= Axes   =# Tuple{Base.OneTo{Int64}},
    #= Idxing =# Base.TupleLL{Tuple{Tuple{},Tuple{}},Base.TupleLL{Tuple{Tuple{Bool},Tuple{Int64}},Base.TupleLL{Tuple{Tuple{Bool},Tuple{Int64}},Base.TupleLLEnd}}},
    #= F      =# typeof(+),
    #= Args   =# Base.TupleLL{Int64,Base.TupleLL{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Int64,Tuple{Base.OneTo{Int64}},Base.TupleLL{Tuple{Tuple{},Tuple{}},Base.TupleLL{Tuple{Tuple{Bool},Tuple{Int64}},Base.TupleLLEnd}},typeof(*),Base.TupleLL{Int64,Base.TupleLL{UnitRange{Int64},Base.TupleLLEnd}}},Base.TupleLL{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1},Float64,Tuple{Base.OneTo{Int64}},Base.TupleLL{Tuple{Tuple{Bool},Tuple{Int64}},Base.TupleLLEnd},typeof(sin),Base.TupleLL{UnitRange{Int64},Base.TupleLLEnd}},Base.TupleLLEnd}}}
}

Aha, ok, now we can see that it's using the DefaultArrayStyle{1}, with an eltype of Float64. The last four parameters simply encode the types of the four fields of the struct — those are more easily understood via their values:

julia> bc′.axes
(Base.OneTo(10),)

julia> bc′.indexing
{((), ()), ((true,), (1,)), ((true,), (1,))}

julia> bc′.f
+ (generic function with 177 methods)

julia> bc′.args
{3, Broadcasted(*, 2, 1:10), Broadcasted(sin, 1:10)}

Makes sense — we have f and its arguments, as well as information about the resulting axes and how to index into each of the three arguments (I think). That funny thing that prints as a {} list? That's a TupleLL — a cons-list representation of a tuple-like object. Why not just use regular tuples? We want this to be type-stable for computations involving types, so it needs to store the specific Type{T} information for each type (and not just a DataType). It's not quite clear to me why the indexing property needs to be a TupleLL yet.

Incremental results

If I remove the sin(x) from the above broadcast, we end up with a nearly identical broadcast, but once we copy it (by way of execute) we get a range. Hooray!

julia> bc = instantiate(make(+, 3, make(*, 2, x)))
Broadcasted(+, 3, Broadcasted(*, 2, 1:10))

julia> copy(bc)
5:2:23

This occurs because of a hook that checks is_broadcast_incremental before doing the allocation of the return array and copyto!ing it. For this particular bc object, that returns true because it's a DefaultArrayStyle{1} and it recursively walks through all the wrapped functions to determine they're all known (probably) linear operations. It's not an exact calculation, though, since *(::Scalar, ::Range) is linear but *(::Range, ::Range) isn't. If this is true, though, then the broadcast machinery will recursively walk through the entire Broadcasted object and explicitly call broadcast(f, args...) for all the leaf-most-nodes. After each leaf, though, it "tries again" and re-runs all the execute broadcast machinery again with these new intermediate results. Thus, it checks again if it's still is_broadcast_incremental with these new results. This leads to that one temporary Tim talks about in the case where you have a complicated fused broadcast that the maybe-linear-operation-check guesses wrong on.

longest_tuple(A, B, Bs...) = longest_tuple(Bs...)
longest_tuple(A::Tuple) = A
@simd for I in CartesianIndices(axes(bc))
@inbounds dest[I] = _broadcast_getindex(bc, I)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nifty that you can index directly into the Broadcasted objects. Have you thought about making this a bit more first class? Like calling it simply getindex(::Broadcasted, …)? Or maybe even Broadcasted <: AbstractArray? Not crucial by any means. Just cool.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have, and I've gone back and forth on that. Currently I haven't done that for one somewhat arcane reason: in A .+ (v .+ 1) (A is a matrix and v a vector), the inner broadcasting operation with v doesn't have any objects that have 2d indices. Consequently it seems that bcinner[i, j] should throw a BoundsError if j>1, but _broadcast_getindex doesn't fall victim to that requirement.

Now, you might argue that because bcinner will typically be instantiated, the axes field indicates that really, it does have suitable indices. But not all Broadcasted types get instantiated (e.g., for StaticArrays one may want to short-circuit the axes field computation). Since in such cases there's no hint that bcinner should be supported over a larger domain, I decided not to call it getindex.

Like I said, pretty arcane.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we define getindex just on Broadcasted objects that have instantiated indexers? The main reason to consider this is that it provides a very nice public & stable API for working with Broadcasted objects. Folks will need to handle them in their copyto! methods, but it'd be nice to be able to draw some clear lines within Base.Broadcast on what's public and what's private.

How are you imagining StaticArrays would use this? Broadcast.flatten?

broadcasting by preprocessing the arguments to potentially wrap them with indexing helpers.
* Slightly clearer recursion through arg lists in not_nested
* Move show(::IO, ::Broadcasted) to a more sensible location and have it print its type fully qualified with the `Style` parameter.
* origin/master: (23 commits)
  fix deprecations of \cdot and \times (#26884)
  Support reshaping custom 0-dimensional arrays (#26870)
  fix some cases of dot syntax lowering (#26878)
  Pkg3: deterministically close the LibGit2 repo in tests (#26883)
  code loading docs: add missing graph edge (#26874)
  add news for #26858 and #26859 [ci skip] (#26869)
  Deprecate using && and || within at-dot expressions (#26792)
  widen `Int8` and `Int16` to `Int` instead of `Int32` (#26859)
  fix #26038, make `isequal` consistent with `hash` for `Ptr` (#26858)
  Deprecate variadic size(A, dim1, dim2, dims...) method (#26862)
  add using Random to example in manual (#26864)
  warn once instead of depwarn since we want to test it
  Revert "reserve syntax that could be used for computed field types (#18466) (#26816)" (#26857)
  Fix compilation on LLVM 6.0
  change promotion behaviour of `cumsum` and `cumsum!` to match `sum`
  [LLVM 6] add patch to diamond if-conversion
  add a precompile command that can be used to precompile all dependencies (#254)
  use registry if no version entry exist in project for developed pacakges
  make Pkg3 work as a drop in for the old CI scripts
  update registries when adding (#253)
  ...
@mbauman
Copy link
Member

mbauman commented Apr 23, 2018

Ok, I think I have the performance as good as I can make it for now. There are still two classes of regressions:

  • A failure to emit vectorized instructions in a place where we would have vectorized before. Given f(r, x) = r .= x.*x.*x.*x, code_llvm(f, Tuple{Vector{Float64}, Vector{Float64}}) emits SIMD vectorized instructions on master but not on this branch. This results in a 1.5x performance degradation in this example and is responsible for the 1.85x degradation in the dotop test Nanosoldier flagged. I am not able to understand why LLVM isn't vectorizing this on my branch.

  • A higher cost to broadcasting over non-type-stable functions or non-concretely-typed collections (including small discriminated unions). This is responsible for the 4-8x regressions in the union tests. I've been able to mitigate this somewhat by performing the same sort of preprocessing as we do in copyto! — that is, precomputing indexing helpers and wrapping arguments as necessary — but I'm still seeing 3x-6x regressions locally. I had hoped the new optimizer would prove more willing to work with the resulting Union{Vector{T}, Vector{S}, Vector{Union{T,S}}}, but so far my rudimentary tests with enable_new_optimizer have shown the same performance.

Everything else here is looking great. Given the many issues this resolves I'm inclined to merge with those two items outstanding. Given the messy history, I'll rebase into one well-documented squashed commit that attributes all four authors here as coauthors. I'll leave this PR as is — I believe the history here is useful — and create a new PR.

@andreasnoack
Copy link
Member

Do we have any quantifications of the compile time consequences of this change?

@jekbradbury
Copy link
Contributor

What about your earlier thought that "we could just use axes"? Does following that through to its conclusion lead to unexpected issues?

@mbauman
Copy link
Member

mbauman commented Apr 23, 2018

Do we have any quantifications of the compile time consequences of this change?

Timing that right now.

What about your earlier thought that "we could just use axes"? Does following that through to its conclusion lead to unexpected issues?

We have two types that we use heavily in broadcast that don't define axes (and are thus "patched up" in broadcast_axes): Ref and Tuple. I'd like to define axes directly on those both, but it seems like a completely orthogonal change that should be tackled independently. So for the time being, broadcast_axes remains.

NEWS.md Outdated
`AbstractDict`, `AbstractString`, `Tuple` and `NamedTuple` objects ([#24774]).
In particular, this means that it returns `CartesianIndex` objects for matrices
and higher-dimensional arrays instead of linear indices as was previously the case.
Use `Int[LinearIndices(size(a))[i] for i in find(f, a)]` to compute linear indices.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks redundant with the next news item (and partly contradicts it).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for the reminder. This was a merge mistake.

@mbauman
Copy link
Member

mbauman commented Apr 23, 2018

Pseudo-compile-time tests:

test this PR (110a0a5) master (692408c)
make && touch base/version.jl && time make 4m26.788s 4m33.808s
time ./julia stdlib/LinearAlgebra/test/triangular.jl 6m39.415s 6m36.216s
time ./julia -e 'using Random, Test; include("test/broadcast.jl")' 0m40.090s 0m38.218s
time ./julia stdlib/SparseArrays/test/higherorderfns.jl 1m57.239s 1m40.889s
time ./julia stdlib/SparseArrays/test/sparsevector.jl 2m33.843s 2m34.833s
time ./julia stdlib/LinearAlgebra/test/dense.jl 2m30.039s 2m22.954s
perf_op_bcast (1) 0.248796 0.034302
perf_op_bcast (2) 0.051545 0.045142

The perf_op_bcast are the results of this interactive julia prompt, sequentially numbered. I believe the regression in (1) simply reflects the lack of the appropriate precompile statements.

julia> perf_op_bcast!(r, x) = r .= 3 .* x .- 4 .* x.^2 .+ x .* x .- x .^ 3;

julia> @time perf_op_bcast!([1],[1])
  0.034302 seconds (43.61 k allocations: 2.416 MiB)
1-element Array{Int64,1}:
 -1

julia> perf_op_bcast!(r, x) = r .= 3 .* x .- 4 .* x.^2 .+ x .* x .- x .^ 3;

julia> @time perf_op_bcast!([1],[1])
  0.045142 seconds (39.13 k allocations: 2.140 MiB)
1-element Array{Int64,1}:
 -1

mbauman added a commit that referenced this pull request Apr 23, 2018
This patch represents the combined efforts of four individuals, over 60
commits, and an iterated design over (at least) three pull requests that
spanned nearly an entire year (closes #22063, #23692, #25377 by superceding
them).

This introduces a pure Julia data structure that represents a fused broadcast
expression.  For example, the expression `2 .* (x .+ 1)` lowers to:

```julia
julia> Meta.@lower 2 .* (x .+ 1)
:($(Expr(:thunk, CodeInfo(:(begin
      Core.SSAValue(0) = (Base.getproperty)(Base.Broadcast, :materialize)
      Core.SSAValue(1) = (Base.getproperty)(Base.Broadcast, :make)
      Core.SSAValue(2) = (Base.getproperty)(Base.Broadcast, :make)
      Core.SSAValue(3) = (Core.SSAValue(2))(+, x, 1)
      Core.SSAValue(4) = (Core.SSAValue(1))(*, 2, Core.SSAValue(3))
      Core.SSAValue(5) = (Core.SSAValue(0))(Core.SSAValue(4))
      return Core.SSAValue(5)
  end)))))
```

Or, slightly more readably as:

```julia
using .Broadcast: materialize, make
materialize(make(*, 2, make(+, x, 1)))
```

The `Broadcast.make` function serves two purposes. Its primary purpose is to
construct the `Broadcast.Broadcasted` objects that hold onto the function, the
tuple of arguments (potentially including nested `Broadcasted` arguments), and
sometimes a set of `axes` to include knowledge of the outer shape. The
secondary purpose, however, is to allow an "out" for objects that _don't_ want
to participate in fusion. For example, if `x` is a range in the above `2 .* (x
.+ 1)` expression, it needn't allocate an array and operate elementwise — it
can just compute and return a new range. Thus custom structures are able to
specialize `Broadcast.make(f, args...)` just as they'd specialize on `f`
normally to return an immediate result.

`Broadcast.materialize` is identity for everything _except_ `Broadcasted`
objects for which it allocates an appropriate result and computes the
broadcast. It does two things: it `initialize`s the outermost `Broadcasted`
object to compute its axes and then `copy`s it.

Similarly, an in-place fused broadcast like `y .= 2 .* (x .+ 1)` uses the exact
same expression tree to compute the right-hand side of the expression as above,
and then uses `materialize!(y, make(*, 2, make(+, x, 1)))` to `instantiate` the
`Broadcasted` expression tree and then `copyto!` it into the given destination.

All-together, this forms a complete API for custom types to extend and
customize the behavior of broadcast (fixes #22060). It uses the existing
`BroadcastStyle`s throughout to simplify dispatch on many arguments:

* Custom types can opt-out of broadcast fusion by specializing
  `Broadcast.make(f, args...)` or `Broadcast.make(::BroadcastStyle, f, args...)`.

* The `Broadcasted` object computes and stores the type of the combined
  `BroadcastStyle` of its arguments as its first type parameter, allowing for
  easy dispatch and specialization.

* Custom Broadcast storage is still allocated via `broadcast_similar`, however
  instead of passing just a function as a first argument, the entire
  `Broadcasted` object is passed as a final argument. This potentially allows
  for much more runtime specialization dependent upon the exact expression
  given.

* Custom broadcast implmentations for a `CustomStyle` are defined by
  specializing `copy(bc::Broadcasted{CustomStyle})` or
  `copyto!(dest::AbstractArray, bc::Broadcasted{CustomStyle})`.

* Fallback broadcast specializations for a given output object of type `Dest`
  (for the `DefaultArrayStyle` or another such style that hasn't implemented
  assignments into such an object) are defined by specializing
  `copyto(dest::Dest, bc::Broadcasted{Nothing})`.

As it fully supports range broadcasting, this now deprecates `(1:5) + 2` to
`.+`, just as had been done for all `AbstractArray`s in general.

As a first-mover proof of concept, LinearAlgebra uses this new system to
improve broadcasting over structured arrays. Before, broadcasting over a
structured matrix would result in a sparse array. Now, broadcasting over a
structured matrix will _either_ return an appropriately structured matrix _or_
a dense array. This does incur a type instability (in the form of a
discriminated union) in some situations, but thanks to type-based introspection
of the `Broadcasted` wrapper commonly used functions can be special cased to be
type stable.  For example:

```julia
julia> f(d) = round.(Int, d)
f (generic function with 1 method)

julia> @inferred f(Diagonal(rand(3)))
3×3 Diagonal{Int64,Array{Int64,1}}:
 0  ⋅  ⋅
 ⋅  0  ⋅
 ⋅  ⋅  1

julia> @inferred Diagonal(rand(3)) .* 3
ERROR: return type Diagonal{Float64,Array{Float64,1}} does not match inferred return type Union{Array{Float64,2}, Diagonal{Float64,Array{Float64,1}}}
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] top-level scope

julia> @inferred Diagonal(1:4) .+ Bidiagonal(rand(4), rand(3), 'U') .* Tridiagonal(1:3, 1:4, 1:3)
4×4 Tridiagonal{Float64,Array{Float64,1}}:
 1.30771  0.838589   ⋅          ⋅
 0.0      3.89109   0.0459757   ⋅
  ⋅       0.0       4.48033    2.51508
  ⋅        ⋅        0.0        6.23739
```

In addition to the issues referenced above, it fixes:

* Fixes #19313, #22053, #23445, and #24586: Literals are no longer treated
  specially in a fused broadcast; they're just arguments in a `Broadcasted`
  object like everything else.

* Fixes #21094: Since broadcasting is now represented by a pure Julia
  datastructure it can be created within `@generated` functions and serialized.

* Fixes #26097: The fallback destination-array specialization method of
  `copyto!` is specifically implemented as `Broadcasted{Nothing}` and will not
  be confused by `nothing` arguments.

* Fixes the broadcast-specific element of #25499: The default base broadcast
  implementation no longer depends upon `Base._return_type` to allocate its
  array (except in the empty or concretely-type cases). Note that the sparse
  implementation (#19595) is still dependent upon inference and is _not_ fixed.

* Fixes #25340: Functions are treated like normal values just like arguments
  and only evaluated once.

* Fixes #22255, and is performant with 12+ fused broadcasts. Okay, that one was
  fixed on master already, but this fixes it now, too.

* Fixes #25521.

* The performance of this patch has been thoroughly tested through its
  iterative development process in #25377. There remain [two classes of
  performance regressions](#25377) that Nanosoldier flagged.

* #25691: Propagation of constant literals sill lose their constant-ness upon
  going through the broadcast machinery. I believe quite a large number of
  functions would need to be marked as `@pure` to support this -- including
  functions that are intended to be specialized.

(For bookkeeping, this is the squashed version of the [teh-jn/lazydotfuse](#25377)
branch as of a1d4e7e. Squashed and separated
out to make it easier to review and commit)

Co-authored-by: Tim Holy <tim.holy@gmail.com>
Co-authored-by: Jameson Nash <vtjnash@gmail.com>
Co-authored-by: Andrew Keller <ajkeller34@users.noreply.github.com>
@JeffBezanson
Copy link
Member

Should this be closed?

@mbauman
Copy link
Member

mbauman commented Apr 24, 2018

Sure, let's make sure discussion is consolidated in the new PR: #26891.

@mbauman mbauman closed this Apr 24, 2018
@mbauman mbauman added the broadcast Applying a function over a collection label Apr 24, 2018
@ararslan ararslan deleted the teh-jn/lazydotfuse branch April 25, 2018 01:13
Keno pushed a commit that referenced this pull request Apr 27, 2018
This patch represents the combined efforts of four individuals, over 60
commits, and an iterated design over (at least) three pull requests that
spanned nearly an entire year (closes #22063, #23692, #25377 by superceding
them).

This introduces a pure Julia data structure that represents a fused broadcast
expression.  For example, the expression `2 .* (x .+ 1)` lowers to:

```julia
julia> Meta.@lower 2 .* (x .+ 1)
:($(Expr(:thunk, CodeInfo(:(begin
      Core.SSAValue(0) = (Base.getproperty)(Base.Broadcast, :materialize)
      Core.SSAValue(1) = (Base.getproperty)(Base.Broadcast, :make)
      Core.SSAValue(2) = (Base.getproperty)(Base.Broadcast, :make)
      Core.SSAValue(3) = (Core.SSAValue(2))(+, x, 1)
      Core.SSAValue(4) = (Core.SSAValue(1))(*, 2, Core.SSAValue(3))
      Core.SSAValue(5) = (Core.SSAValue(0))(Core.SSAValue(4))
      return Core.SSAValue(5)
  end)))))
```

Or, slightly more readably as:

```julia
using .Broadcast: materialize, make
materialize(make(*, 2, make(+, x, 1)))
```

The `Broadcast.make` function serves two purposes. Its primary purpose is to
construct the `Broadcast.Broadcasted` objects that hold onto the function, the
tuple of arguments (potentially including nested `Broadcasted` arguments), and
sometimes a set of `axes` to include knowledge of the outer shape. The
secondary purpose, however, is to allow an "out" for objects that _don't_ want
to participate in fusion. For example, if `x` is a range in the above `2 .* (x
.+ 1)` expression, it needn't allocate an array and operate elementwise — it
can just compute and return a new range. Thus custom structures are able to
specialize `Broadcast.make(f, args...)` just as they'd specialize on `f`
normally to return an immediate result.

`Broadcast.materialize` is identity for everything _except_ `Broadcasted`
objects for which it allocates an appropriate result and computes the
broadcast. It does two things: it `initialize`s the outermost `Broadcasted`
object to compute its axes and then `copy`s it.

Similarly, an in-place fused broadcast like `y .= 2 .* (x .+ 1)` uses the exact
same expression tree to compute the right-hand side of the expression as above,
and then uses `materialize!(y, make(*, 2, make(+, x, 1)))` to `instantiate` the
`Broadcasted` expression tree and then `copyto!` it into the given destination.

All-together, this forms a complete API for custom types to extend and
customize the behavior of broadcast (fixes #22060). It uses the existing
`BroadcastStyle`s throughout to simplify dispatch on many arguments:

* Custom types can opt-out of broadcast fusion by specializing
  `Broadcast.make(f, args...)` or `Broadcast.make(::BroadcastStyle, f, args...)`.

* The `Broadcasted` object computes and stores the type of the combined
  `BroadcastStyle` of its arguments as its first type parameter, allowing for
  easy dispatch and specialization.

* Custom Broadcast storage is still allocated via `broadcast_similar`, however
  instead of passing just a function as a first argument, the entire
  `Broadcasted` object is passed as a final argument. This potentially allows
  for much more runtime specialization dependent upon the exact expression
  given.

* Custom broadcast implmentations for a `CustomStyle` are defined by
  specializing `copy(bc::Broadcasted{CustomStyle})` or
  `copyto!(dest::AbstractArray, bc::Broadcasted{CustomStyle})`.

* Fallback broadcast specializations for a given output object of type `Dest`
  (for the `DefaultArrayStyle` or another such style that hasn't implemented
  assignments into such an object) are defined by specializing
  `copyto(dest::Dest, bc::Broadcasted{Nothing})`.

As it fully supports range broadcasting, this now deprecates `(1:5) + 2` to
`.+`, just as had been done for all `AbstractArray`s in general.

As a first-mover proof of concept, LinearAlgebra uses this new system to
improve broadcasting over structured arrays. Before, broadcasting over a
structured matrix would result in a sparse array. Now, broadcasting over a
structured matrix will _either_ return an appropriately structured matrix _or_
a dense array. This does incur a type instability (in the form of a
discriminated union) in some situations, but thanks to type-based introspection
of the `Broadcasted` wrapper commonly used functions can be special cased to be
type stable.  For example:

```julia
julia> f(d) = round.(Int, d)
f (generic function with 1 method)

julia> @inferred f(Diagonal(rand(3)))
3×3 Diagonal{Int64,Array{Int64,1}}:
 0  ⋅  ⋅
 ⋅  0  ⋅
 ⋅  ⋅  1

julia> @inferred Diagonal(rand(3)) .* 3
ERROR: return type Diagonal{Float64,Array{Float64,1}} does not match inferred return type Union{Array{Float64,2}, Diagonal{Float64,Array{Float64,1}}}
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] top-level scope

julia> @inferred Diagonal(1:4) .+ Bidiagonal(rand(4), rand(3), 'U') .* Tridiagonal(1:3, 1:4, 1:3)
4×4 Tridiagonal{Float64,Array{Float64,1}}:
 1.30771  0.838589   ⋅          ⋅
 0.0      3.89109   0.0459757   ⋅
  ⋅       0.0       4.48033    2.51508
  ⋅        ⋅        0.0        6.23739
```

In addition to the issues referenced above, it fixes:

* Fixes #19313, #22053, #23445, and #24586: Literals are no longer treated
  specially in a fused broadcast; they're just arguments in a `Broadcasted`
  object like everything else.

* Fixes #21094: Since broadcasting is now represented by a pure Julia
  datastructure it can be created within `@generated` functions and serialized.

* Fixes #26097: The fallback destination-array specialization method of
  `copyto!` is specifically implemented as `Broadcasted{Nothing}` and will not
  be confused by `nothing` arguments.

* Fixes the broadcast-specific element of #25499: The default base broadcast
  implementation no longer depends upon `Base._return_type` to allocate its
  array (except in the empty or concretely-type cases). Note that the sparse
  implementation (#19595) is still dependent upon inference and is _not_ fixed.

* Fixes #25340: Functions are treated like normal values just like arguments
  and only evaluated once.

* Fixes #22255, and is performant with 12+ fused broadcasts. Okay, that one was
  fixed on master already, but this fixes it now, too.

* Fixes #25521.

* The performance of this patch has been thoroughly tested through its
  iterative development process in #25377. There remain [two classes of
  performance regressions](#25377) that Nanosoldier flagged.

* #25691: Propagation of constant literals sill lose their constant-ness upon
  going through the broadcast machinery. I believe quite a large number of
  functions would need to be marked as `@pure` to support this -- including
  functions that are intended to be specialized.

(For bookkeeping, this is the squashed version of the [teh-jn/lazydotfuse](#25377)
branch as of a1d4e7e. Squashed and separated
out to make it easier to review and commit)

Co-authored-by: Tim Holy <tim.holy@gmail.com>
Co-authored-by: Jameson Nash <vtjnash@gmail.com>
Co-authored-by: Andrew Keller <ajkeller34@users.noreply.github.com>
andreasnoack pushed a commit to JuliaLinearAlgebra/SparseLinearAlgebra.jl that referenced this pull request Jun 21, 2018
This patch represents the combined efforts of four individuals, over 60
commits, and an iterated design over (at least) three pull requests that
spanned nearly an entire year (closes #22063, #23692, #25377 by superceding
them).

This introduces a pure Julia data structure that represents a fused broadcast
expression.  For example, the expression `2 .* (x .+ 1)` lowers to:

```julia
julia> Meta.@lower 2 .* (x .+ 1)
:($(Expr(:thunk, CodeInfo(:(begin
      Core.SSAValue(0) = (Base.getproperty)(Base.Broadcast, :materialize)
      Core.SSAValue(1) = (Base.getproperty)(Base.Broadcast, :make)
      Core.SSAValue(2) = (Base.getproperty)(Base.Broadcast, :make)
      Core.SSAValue(3) = (Core.SSAValue(2))(+, x, 1)
      Core.SSAValue(4) = (Core.SSAValue(1))(*, 2, Core.SSAValue(3))
      Core.SSAValue(5) = (Core.SSAValue(0))(Core.SSAValue(4))
      return Core.SSAValue(5)
  end)))))
```

Or, slightly more readably as:

```julia
using .Broadcast: materialize, make
materialize(make(*, 2, make(+, x, 1)))
```

The `Broadcast.make` function serves two purposes. Its primary purpose is to
construct the `Broadcast.Broadcasted` objects that hold onto the function, the
tuple of arguments (potentially including nested `Broadcasted` arguments), and
sometimes a set of `axes` to include knowledge of the outer shape. The
secondary purpose, however, is to allow an "out" for objects that _don't_ want
to participate in fusion. For example, if `x` is a range in the above `2 .* (x
.+ 1)` expression, it needn't allocate an array and operate elementwise — it
can just compute and return a new range. Thus custom structures are able to
specialize `Broadcast.make(f, args...)` just as they'd specialize on `f`
normally to return an immediate result.

`Broadcast.materialize` is identity for everything _except_ `Broadcasted`
objects for which it allocates an appropriate result and computes the
broadcast. It does two things: it `initialize`s the outermost `Broadcasted`
object to compute its axes and then `copy`s it.

Similarly, an in-place fused broadcast like `y .= 2 .* (x .+ 1)` uses the exact
same expression tree to compute the right-hand side of the expression as above,
and then uses `materialize!(y, make(*, 2, make(+, x, 1)))` to `instantiate` the
`Broadcasted` expression tree and then `copyto!` it into the given destination.

All-together, this forms a complete API for custom types to extend and
customize the behavior of broadcast (fixes #22060). It uses the existing
`BroadcastStyle`s throughout to simplify dispatch on many arguments:

* Custom types can opt-out of broadcast fusion by specializing
  `Broadcast.make(f, args...)` or `Broadcast.make(::BroadcastStyle, f, args...)`.

* The `Broadcasted` object computes and stores the type of the combined
  `BroadcastStyle` of its arguments as its first type parameter, allowing for
  easy dispatch and specialization.

* Custom Broadcast storage is still allocated via `broadcast_similar`, however
  instead of passing just a function as a first argument, the entire
  `Broadcasted` object is passed as a final argument. This potentially allows
  for much more runtime specialization dependent upon the exact expression
  given.

* Custom broadcast implmentations for a `CustomStyle` are defined by
  specializing `copy(bc::Broadcasted{CustomStyle})` or
  `copyto!(dest::AbstractArray, bc::Broadcasted{CustomStyle})`.

* Fallback broadcast specializations for a given output object of type `Dest`
  (for the `DefaultArrayStyle` or another such style that hasn't implemented
  assignments into such an object) are defined by specializing
  `copyto(dest::Dest, bc::Broadcasted{Nothing})`.

As it fully supports range broadcasting, this now deprecates `(1:5) + 2` to
`.+`, just as had been done for all `AbstractArray`s in general.

As a first-mover proof of concept, LinearAlgebra uses this new system to
improve broadcasting over structured arrays. Before, broadcasting over a
structured matrix would result in a sparse array. Now, broadcasting over a
structured matrix will _either_ return an appropriately structured matrix _or_
a dense array. This does incur a type instability (in the form of a
discriminated union) in some situations, but thanks to type-based introspection
of the `Broadcasted` wrapper commonly used functions can be special cased to be
type stable.  For example:

```julia
julia> f(d) = round.(Int, d)
f (generic function with 1 method)

julia> @inferred f(Diagonal(rand(3)))
3×3 Diagonal{Int64,Array{Int64,1}}:
 0  ⋅  ⋅
 ⋅  0  ⋅
 ⋅  ⋅  1

julia> @inferred Diagonal(rand(3)) .* 3
ERROR: return type Diagonal{Float64,Array{Float64,1}} does not match inferred return type Union{Array{Float64,2}, Diagonal{Float64,Array{Float64,1}}}
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] top-level scope

julia> @inferred Diagonal(1:4) .+ Bidiagonal(rand(4), rand(3), 'U') .* Tridiagonal(1:3, 1:4, 1:3)
4×4 Tridiagonal{Float64,Array{Float64,1}}:
 1.30771  0.838589   ⋅          ⋅
 0.0      3.89109   0.0459757   ⋅
  ⋅       0.0       4.48033    2.51508
  ⋅        ⋅        0.0        6.23739
```

In addition to the issues referenced above, it fixes:

* Fixes #19313, #22053, #23445, and #24586: Literals are no longer treated
  specially in a fused broadcast; they're just arguments in a `Broadcasted`
  object like everything else.

* Fixes #21094: Since broadcasting is now represented by a pure Julia
  datastructure it can be created within `@generated` functions and serialized.

* Fixes #26097: The fallback destination-array specialization method of
  `copyto!` is specifically implemented as `Broadcasted{Nothing}` and will not
  be confused by `nothing` arguments.

* Fixes the broadcast-specific element of #25499: The default base broadcast
  implementation no longer depends upon `Base._return_type` to allocate its
  array (except in the empty or concretely-type cases). Note that the sparse
  implementation (#19595) is still dependent upon inference and is _not_ fixed.

* Fixes #25340: Functions are treated like normal values just like arguments
  and only evaluated once.

* Fixes #22255, and is performant with 12+ fused broadcasts. Okay, that one was
  fixed on master already, but this fixes it now, too.

* Fixes #25521.

* The performance of this patch has been thoroughly tested through its
  iterative development process in #25377. There remain [two classes of
  performance regressions](#25377) that Nanosoldier flagged.

* #25691: Propagation of constant literals sill lose their constant-ness upon
  going through the broadcast machinery. I believe quite a large number of
  functions would need to be marked as `@pure` to support this -- including
  functions that are intended to be specialized.

(For bookkeeping, this is the squashed version of the [teh-jn/lazydotfuse](JuliaLang/julia#25377)
branch as of a1d4e7ec9756ada74fb48f2c514615b9d981cf5c. Squashed and separated
out to make it easier to review and commit)

Co-authored-by: Tim Holy <tim.holy@gmail.com>
Co-authored-by: Jameson Nash <vtjnash@gmail.com>
Co-authored-by: Andrew Keller <ajkeller34@users.noreply.github.com>
nalimilan pushed a commit to JuliaStats/Statistics.jl that referenced this pull request Sep 18, 2019
This patch represents the combined efforts of four individuals, over 60
commits, and an iterated design over (at least) three pull requests that
spanned nearly an entire year (closes #22063, #23692, #25377 by superceding
them).

This introduces a pure Julia data structure that represents a fused broadcast
expression.  For example, the expression `2 .* (x .+ 1)` lowers to:

```julia
julia> Meta.@lower 2 .* (x .+ 1)
:($(Expr(:thunk, CodeInfo(:(begin
      Core.SSAValue(0) = (Base.getproperty)(Base.Broadcast, :materialize)
      Core.SSAValue(1) = (Base.getproperty)(Base.Broadcast, :make)
      Core.SSAValue(2) = (Base.getproperty)(Base.Broadcast, :make)
      Core.SSAValue(3) = (Core.SSAValue(2))(+, x, 1)
      Core.SSAValue(4) = (Core.SSAValue(1))(*, 2, Core.SSAValue(3))
      Core.SSAValue(5) = (Core.SSAValue(0))(Core.SSAValue(4))
      return Core.SSAValue(5)
  end)))))
```

Or, slightly more readably as:

```julia
using .Broadcast: materialize, make
materialize(make(*, 2, make(+, x, 1)))
```

The `Broadcast.make` function serves two purposes. Its primary purpose is to
construct the `Broadcast.Broadcasted` objects that hold onto the function, the
tuple of arguments (potentially including nested `Broadcasted` arguments), and
sometimes a set of `axes` to include knowledge of the outer shape. The
secondary purpose, however, is to allow an "out" for objects that _don't_ want
to participate in fusion. For example, if `x` is a range in the above `2 .* (x
.+ 1)` expression, it needn't allocate an array and operate elementwise — it
can just compute and return a new range. Thus custom structures are able to
specialize `Broadcast.make(f, args...)` just as they'd specialize on `f`
normally to return an immediate result.

`Broadcast.materialize` is identity for everything _except_ `Broadcasted`
objects for which it allocates an appropriate result and computes the
broadcast. It does two things: it `initialize`s the outermost `Broadcasted`
object to compute its axes and then `copy`s it.

Similarly, an in-place fused broadcast like `y .= 2 .* (x .+ 1)` uses the exact
same expression tree to compute the right-hand side of the expression as above,
and then uses `materialize!(y, make(*, 2, make(+, x, 1)))` to `instantiate` the
`Broadcasted` expression tree and then `copyto!` it into the given destination.

All-together, this forms a complete API for custom types to extend and
customize the behavior of broadcast (fixes #22060). It uses the existing
`BroadcastStyle`s throughout to simplify dispatch on many arguments:

* Custom types can opt-out of broadcast fusion by specializing
  `Broadcast.make(f, args...)` or `Broadcast.make(::BroadcastStyle, f, args...)`.

* The `Broadcasted` object computes and stores the type of the combined
  `BroadcastStyle` of its arguments as its first type parameter, allowing for
  easy dispatch and specialization.

* Custom Broadcast storage is still allocated via `broadcast_similar`, however
  instead of passing just a function as a first argument, the entire
  `Broadcasted` object is passed as a final argument. This potentially allows
  for much more runtime specialization dependent upon the exact expression
  given.

* Custom broadcast implmentations for a `CustomStyle` are defined by
  specializing `copy(bc::Broadcasted{CustomStyle})` or
  `copyto!(dest::AbstractArray, bc::Broadcasted{CustomStyle})`.

* Fallback broadcast specializations for a given output object of type `Dest`
  (for the `DefaultArrayStyle` or another such style that hasn't implemented
  assignments into such an object) are defined by specializing
  `copyto(dest::Dest, bc::Broadcasted{Nothing})`.

As it fully supports range broadcasting, this now deprecates `(1:5) + 2` to
`.+`, just as had been done for all `AbstractArray`s in general.

As a first-mover proof of concept, LinearAlgebra uses this new system to
improve broadcasting over structured arrays. Before, broadcasting over a
structured matrix would result in a sparse array. Now, broadcasting over a
structured matrix will _either_ return an appropriately structured matrix _or_
a dense array. This does incur a type instability (in the form of a
discriminated union) in some situations, but thanks to type-based introspection
of the `Broadcasted` wrapper commonly used functions can be special cased to be
type stable.  For example:

```julia
julia> f(d) = round.(Int, d)
f (generic function with 1 method)

julia> @inferred f(Diagonal(rand(3)))
3×3 Diagonal{Int64,Array{Int64,1}}:
 0  ⋅  ⋅
 ⋅  0  ⋅
 ⋅  ⋅  1

julia> @inferred Diagonal(rand(3)) .* 3
ERROR: return type Diagonal{Float64,Array{Float64,1}} does not match inferred return type Union{Array{Float64,2}, Diagonal{Float64,Array{Float64,1}}}
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] top-level scope

julia> @inferred Diagonal(1:4) .+ Bidiagonal(rand(4), rand(3), 'U') .* Tridiagonal(1:3, 1:4, 1:3)
4×4 Tridiagonal{Float64,Array{Float64,1}}:
 1.30771  0.838589   ⋅          ⋅
 0.0      3.89109   0.0459757   ⋅
  ⋅       0.0       4.48033    2.51508
  ⋅        ⋅        0.0        6.23739
```

In addition to the issues referenced above, it fixes:

* Fixes #19313, #22053, #23445, and #24586: Literals are no longer treated
  specially in a fused broadcast; they're just arguments in a `Broadcasted`
  object like everything else.

* Fixes #21094: Since broadcasting is now represented by a pure Julia
  datastructure it can be created within `@generated` functions and serialized.

* Fixes #26097: The fallback destination-array specialization method of
  `copyto!` is specifically implemented as `Broadcasted{Nothing}` and will not
  be confused by `nothing` arguments.

* Fixes the broadcast-specific element of #25499: The default base broadcast
  implementation no longer depends upon `Base._return_type` to allocate its
  array (except in the empty or concretely-type cases). Note that the sparse
  implementation (#19595) is still dependent upon inference and is _not_ fixed.

* Fixes #25340: Functions are treated like normal values just like arguments
  and only evaluated once.

* Fixes #22255, and is performant with 12+ fused broadcasts. Okay, that one was
  fixed on master already, but this fixes it now, too.

* Fixes #25521.

* The performance of this patch has been thoroughly tested through its
  iterative development process in #25377. There remain [two classes of
  performance regressions](#25377) that Nanosoldier flagged.

* #25691: Propagation of constant literals sill lose their constant-ness upon
  going through the broadcast machinery. I believe quite a large number of
  functions would need to be marked as `@pure` to support this -- including
  functions that are intended to be specialized.

(For bookkeeping, this is the squashed version of the [teh-jn/lazydotfuse](JuliaLang/julia#25377)
branch as of a1d4e7ec9756ada74fb48f2c514615b9d981cf5c. Squashed and separated
out to make it easier to review and commit)

Co-authored-by: Tim Holy <tim.holy@gmail.com>
Co-authored-by: Jameson Nash <vtjnash@gmail.com>
Co-authored-by: Andrew Keller <ajkeller34@users.noreply.github.com>
nalimilan pushed a commit to JuliaStats/Statistics.jl that referenced this pull request Sep 18, 2019
This patch represents the combined efforts of four individuals, over 60
commits, and an iterated design over (at least) three pull requests that
spanned nearly an entire year (closes #22063, #23692, #25377 by superceding
them).

This introduces a pure Julia data structure that represents a fused broadcast
expression.  For example, the expression `2 .* (x .+ 1)` lowers to:

```julia
julia> Meta.@lower 2 .* (x .+ 1)
:($(Expr(:thunk, CodeInfo(:(begin
      Core.SSAValue(0) = (Base.getproperty)(Base.Broadcast, :materialize)
      Core.SSAValue(1) = (Base.getproperty)(Base.Broadcast, :make)
      Core.SSAValue(2) = (Base.getproperty)(Base.Broadcast, :make)
      Core.SSAValue(3) = (Core.SSAValue(2))(+, x, 1)
      Core.SSAValue(4) = (Core.SSAValue(1))(*, 2, Core.SSAValue(3))
      Core.SSAValue(5) = (Core.SSAValue(0))(Core.SSAValue(4))
      return Core.SSAValue(5)
  end)))))
```

Or, slightly more readably as:

```julia
using .Broadcast: materialize, make
materialize(make(*, 2, make(+, x, 1)))
```

The `Broadcast.make` function serves two purposes. Its primary purpose is to
construct the `Broadcast.Broadcasted` objects that hold onto the function, the
tuple of arguments (potentially including nested `Broadcasted` arguments), and
sometimes a set of `axes` to include knowledge of the outer shape. The
secondary purpose, however, is to allow an "out" for objects that _don't_ want
to participate in fusion. For example, if `x` is a range in the above `2 .* (x
.+ 1)` expression, it needn't allocate an array and operate elementwise — it
can just compute and return a new range. Thus custom structures are able to
specialize `Broadcast.make(f, args...)` just as they'd specialize on `f`
normally to return an immediate result.

`Broadcast.materialize` is identity for everything _except_ `Broadcasted`
objects for which it allocates an appropriate result and computes the
broadcast. It does two things: it `initialize`s the outermost `Broadcasted`
object to compute its axes and then `copy`s it.

Similarly, an in-place fused broadcast like `y .= 2 .* (x .+ 1)` uses the exact
same expression tree to compute the right-hand side of the expression as above,
and then uses `materialize!(y, make(*, 2, make(+, x, 1)))` to `instantiate` the
`Broadcasted` expression tree and then `copyto!` it into the given destination.

All-together, this forms a complete API for custom types to extend and
customize the behavior of broadcast (fixes #22060). It uses the existing
`BroadcastStyle`s throughout to simplify dispatch on many arguments:

* Custom types can opt-out of broadcast fusion by specializing
  `Broadcast.make(f, args...)` or `Broadcast.make(::BroadcastStyle, f, args...)`.

* The `Broadcasted` object computes and stores the type of the combined
  `BroadcastStyle` of its arguments as its first type parameter, allowing for
  easy dispatch and specialization.

* Custom Broadcast storage is still allocated via `broadcast_similar`, however
  instead of passing just a function as a first argument, the entire
  `Broadcasted` object is passed as a final argument. This potentially allows
  for much more runtime specialization dependent upon the exact expression
  given.

* Custom broadcast implmentations for a `CustomStyle` are defined by
  specializing `copy(bc::Broadcasted{CustomStyle})` or
  `copyto!(dest::AbstractArray, bc::Broadcasted{CustomStyle})`.

* Fallback broadcast specializations for a given output object of type `Dest`
  (for the `DefaultArrayStyle` or another such style that hasn't implemented
  assignments into such an object) are defined by specializing
  `copyto(dest::Dest, bc::Broadcasted{Nothing})`.

As it fully supports range broadcasting, this now deprecates `(1:5) + 2` to
`.+`, just as had been done for all `AbstractArray`s in general.

As a first-mover proof of concept, LinearAlgebra uses this new system to
improve broadcasting over structured arrays. Before, broadcasting over a
structured matrix would result in a sparse array. Now, broadcasting over a
structured matrix will _either_ return an appropriately structured matrix _or_
a dense array. This does incur a type instability (in the form of a
discriminated union) in some situations, but thanks to type-based introspection
of the `Broadcasted` wrapper commonly used functions can be special cased to be
type stable.  For example:

```julia
julia> f(d) = round.(Int, d)
f (generic function with 1 method)

julia> @inferred f(Diagonal(rand(3)))
3×3 Diagonal{Int64,Array{Int64,1}}:
 0  ⋅  ⋅
 ⋅  0  ⋅
 ⋅  ⋅  1

julia> @inferred Diagonal(rand(3)) .* 3
ERROR: return type Diagonal{Float64,Array{Float64,1}} does not match inferred return type Union{Array{Float64,2}, Diagonal{Float64,Array{Float64,1}}}
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] top-level scope

julia> @inferred Diagonal(1:4) .+ Bidiagonal(rand(4), rand(3), 'U') .* Tridiagonal(1:3, 1:4, 1:3)
4×4 Tridiagonal{Float64,Array{Float64,1}}:
 1.30771  0.838589   ⋅          ⋅
 0.0      3.89109   0.0459757   ⋅
  ⋅       0.0       4.48033    2.51508
  ⋅        ⋅        0.0        6.23739
```

In addition to the issues referenced above, it fixes:

* Fixes #19313, #22053, #23445, and #24586: Literals are no longer treated
  specially in a fused broadcast; they're just arguments in a `Broadcasted`
  object like everything else.

* Fixes #21094: Since broadcasting is now represented by a pure Julia
  datastructure it can be created within `@generated` functions and serialized.

* Fixes #26097: The fallback destination-array specialization method of
  `copyto!` is specifically implemented as `Broadcasted{Nothing}` and will not
  be confused by `nothing` arguments.

* Fixes the broadcast-specific element of #25499: The default base broadcast
  implementation no longer depends upon `Base._return_type` to allocate its
  array (except in the empty or concretely-type cases). Note that the sparse
  implementation (#19595) is still dependent upon inference and is _not_ fixed.

* Fixes #25340: Functions are treated like normal values just like arguments
  and only evaluated once.

* Fixes #22255, and is performant with 12+ fused broadcasts. Okay, that one was
  fixed on master already, but this fixes it now, too.

* Fixes #25521.

* The performance of this patch has been thoroughly tested through its
  iterative development process in #25377. There remain [two classes of
  performance regressions](#25377) that Nanosoldier flagged.

* #25691: Propagation of constant literals sill lose their constant-ness upon
  going through the broadcast machinery. I believe quite a large number of
  functions would need to be marked as `@pure` to support this -- including
  functions that are intended to be specialized.

(For bookkeeping, this is the squashed version of the [teh-jn/lazydotfuse](JuliaLang/julia#25377)
branch as of a1d4e7ec9756ada74fb48f2c514615b9d981cf5c. Squashed and separated
out to make it easier to review and commit)

Co-authored-by: Tim Holy <tim.holy@gmail.com>
Co-authored-by: Jameson Nash <vtjnash@gmail.com>
Co-authored-by: Andrew Keller <ajkeller34@users.noreply.github.com>
KristofferC pushed a commit to JuliaSparse/SparseArrays.jl that referenced this pull request Nov 2, 2021
This patch represents the combined efforts of four individuals, over 60
commits, and an iterated design over (at least) three pull requests that
spanned nearly an entire year (closes #22063, #23692, #25377 by superceding
them).

This introduces a pure Julia data structure that represents a fused broadcast
expression.  For example, the expression `2 .* (x .+ 1)` lowers to:

```julia
julia> Meta.@lower 2 .* (x .+ 1)
:($(Expr(:thunk, CodeInfo(:(begin
      Core.SSAValue(0) = (Base.getproperty)(Base.Broadcast, :materialize)
      Core.SSAValue(1) = (Base.getproperty)(Base.Broadcast, :make)
      Core.SSAValue(2) = (Base.getproperty)(Base.Broadcast, :make)
      Core.SSAValue(3) = (Core.SSAValue(2))(+, x, 1)
      Core.SSAValue(4) = (Core.SSAValue(1))(*, 2, Core.SSAValue(3))
      Core.SSAValue(5) = (Core.SSAValue(0))(Core.SSAValue(4))
      return Core.SSAValue(5)
  end)))))
```

Or, slightly more readably as:

```julia
using .Broadcast: materialize, make
materialize(make(*, 2, make(+, x, 1)))
```

The `Broadcast.make` function serves two purposes. Its primary purpose is to
construct the `Broadcast.Broadcasted` objects that hold onto the function, the
tuple of arguments (potentially including nested `Broadcasted` arguments), and
sometimes a set of `axes` to include knowledge of the outer shape. The
secondary purpose, however, is to allow an "out" for objects that _don't_ want
to participate in fusion. For example, if `x` is a range in the above `2 .* (x
.+ 1)` expression, it needn't allocate an array and operate elementwise — it
can just compute and return a new range. Thus custom structures are able to
specialize `Broadcast.make(f, args...)` just as they'd specialize on `f`
normally to return an immediate result.

`Broadcast.materialize` is identity for everything _except_ `Broadcasted`
objects for which it allocates an appropriate result and computes the
broadcast. It does two things: it `initialize`s the outermost `Broadcasted`
object to compute its axes and then `copy`s it.

Similarly, an in-place fused broadcast like `y .= 2 .* (x .+ 1)` uses the exact
same expression tree to compute the right-hand side of the expression as above,
and then uses `materialize!(y, make(*, 2, make(+, x, 1)))` to `instantiate` the
`Broadcasted` expression tree and then `copyto!` it into the given destination.

All-together, this forms a complete API for custom types to extend and
customize the behavior of broadcast (fixes #22060). It uses the existing
`BroadcastStyle`s throughout to simplify dispatch on many arguments:

* Custom types can opt-out of broadcast fusion by specializing
  `Broadcast.make(f, args...)` or `Broadcast.make(::BroadcastStyle, f, args...)`.

* The `Broadcasted` object computes and stores the type of the combined
  `BroadcastStyle` of its arguments as its first type parameter, allowing for
  easy dispatch and specialization.

* Custom Broadcast storage is still allocated via `broadcast_similar`, however
  instead of passing just a function as a first argument, the entire
  `Broadcasted` object is passed as a final argument. This potentially allows
  for much more runtime specialization dependent upon the exact expression
  given.

* Custom broadcast implmentations for a `CustomStyle` are defined by
  specializing `copy(bc::Broadcasted{CustomStyle})` or
  `copyto!(dest::AbstractArray, bc::Broadcasted{CustomStyle})`.

* Fallback broadcast specializations for a given output object of type `Dest`
  (for the `DefaultArrayStyle` or another such style that hasn't implemented
  assignments into such an object) are defined by specializing
  `copyto(dest::Dest, bc::Broadcasted{Nothing})`.

As it fully supports range broadcasting, this now deprecates `(1:5) + 2` to
`.+`, just as had been done for all `AbstractArray`s in general.

As a first-mover proof of concept, LinearAlgebra uses this new system to
improve broadcasting over structured arrays. Before, broadcasting over a
structured matrix would result in a sparse array. Now, broadcasting over a
structured matrix will _either_ return an appropriately structured matrix _or_
a dense array. This does incur a type instability (in the form of a
discriminated union) in some situations, but thanks to type-based introspection
of the `Broadcasted` wrapper commonly used functions can be special cased to be
type stable.  For example:

```julia
julia> f(d) = round.(Int, d)
f (generic function with 1 method)

julia> @inferred f(Diagonal(rand(3)))
3×3 Diagonal{Int64,Array{Int64,1}}:
 0  ⋅  ⋅
 ⋅  0  ⋅
 ⋅  ⋅  1

julia> @inferred Diagonal(rand(3)) .* 3
ERROR: return type Diagonal{Float64,Array{Float64,1}} does not match inferred return type Union{Array{Float64,2}, Diagonal{Float64,Array{Float64,1}}}
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] top-level scope

julia> @inferred Diagonal(1:4) .+ Bidiagonal(rand(4), rand(3), 'U') .* Tridiagonal(1:3, 1:4, 1:3)
4×4 Tridiagonal{Float64,Array{Float64,1}}:
 1.30771  0.838589   ⋅          ⋅
 0.0      3.89109   0.0459757   ⋅
  ⋅       0.0       4.48033    2.51508
  ⋅        ⋅        0.0        6.23739
```

In addition to the issues referenced above, it fixes:

* Fixes #19313, #22053, #23445, and #24586: Literals are no longer treated
  specially in a fused broadcast; they're just arguments in a `Broadcasted`
  object like everything else.

* Fixes #21094: Since broadcasting is now represented by a pure Julia
  datastructure it can be created within `@generated` functions and serialized.

* Fixes #26097: The fallback destination-array specialization method of
  `copyto!` is specifically implemented as `Broadcasted{Nothing}` and will not
  be confused by `nothing` arguments.

* Fixes the broadcast-specific element of #25499: The default base broadcast
  implementation no longer depends upon `Base._return_type` to allocate its
  array (except in the empty or concretely-type cases). Note that the sparse
  implementation (#19595) is still dependent upon inference and is _not_ fixed.

* Fixes #25340: Functions are treated like normal values just like arguments
  and only evaluated once.

* Fixes #22255, and is performant with 12+ fused broadcasts. Okay, that one was
  fixed on master already, but this fixes it now, too.

* Fixes #25521.

* The performance of this patch has been thoroughly tested through its
  iterative development process in #25377. There remain [two classes of
  performance regressions](#25377) that Nanosoldier flagged.

* #25691: Propagation of constant literals sill lose their constant-ness upon
  going through the broadcast machinery. I believe quite a large number of
  functions would need to be marked as `@pure` to support this -- including
  functions that are intended to be specialized.

(For bookkeeping, this is the squashed version of the [teh-jn/lazydotfuse](JuliaLang/julia#25377)
branch as of a1d4e7ec9756ada74fb48f2c514615b9d981cf5c. Squashed and separated
out to make it easier to review and commit)

Co-authored-by: Tim Holy <tim.holy@gmail.com>
Co-authored-by: Jameson Nash <vtjnash@gmail.com>
Co-authored-by: Andrew Keller <ajkeller34@users.noreply.github.com>
thchr added a commit to thchr/StaticArrays.jl that referenced this pull request Mar 10, 2022
- referenced non-extant `indices`
- `broadcasted_indices` was deprecated to `broadcasted_axes` in JuliaLang/julia#25377
- `broadcasted_axes` was then replaced by plain `axes` JuliaLang/julia@a2feccf
- StaticArrays already has `axes` methods
thchr added a commit to thchr/StaticArrays.jl that referenced this pull request Mar 10, 2022
- referenced non-extant `indices`
- `broadcasted_indices` was deprecated to `broadcasted_axes` in JuliaLang/julia#25377
- `broadcasted_axes` was then replaced by plain `axes` JuliaLang/julia@a2feccf
- StaticArrays already has `axes` methods
thchr added a commit to thchr/StaticArrays.jl that referenced this pull request Mar 14, 2022
- referenced non-extant `indices`
- `broadcasted_indices` was deprecated to `broadcasted_axes` in JuliaLang/julia#25377
- `broadcasted_axes` was then replaced by plain `axes` JuliaLang/julia@a2feccf
- StaticArrays already has `axes` methods
thchr added a commit to JuliaArrays/StaticArrays.jl that referenced this pull request Mar 14, 2022
- referenced non-extant `indices`
- `broadcasted_indices` was deprecated to `broadcasted_axes` in JuliaLang/julia#25377
- `broadcasted_axes` was then replaced by plain `axes` JuliaLang/julia@a2feccf
- StaticArrays already has `axes` methods
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
broadcast Applying a function over a collection
Projects
None yet