Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

better internal interface for extending broadcast #22060

Closed
stevengj opened this issue May 25, 2017 · 16 comments · Fixed by #26891
Closed

better internal interface for extending broadcast #22060

stevengj opened this issue May 25, 2017 · 16 comments · Fixed by #26891
Assignees
Labels
broadcast Applying a function over a collection gpu Affects running Julia on a GPU speculative Whether the change will be implemented is speculative

Comments

@stevengj
Copy link
Member

stevengj commented May 25, 2017

@MikeInnes suggested in FluxML/Flux.jl#31 that we change the dot-call lowering in such a way as to allow it to be disabled for custom types (which may reside on a GPU or something and support only a small set of functions).

For example, currently x .+ y .* x.^z is lowered to broadcast((x,y,z) -> x+y*x^z, x,y,z). This could instead be lowered to:

if Base.isfusing(x,y,z)
    broadcast((x,y) -> x+y*x^z, x,y,z)
else
    broadcast(+, x, broadcast(*, y, broadcast(^, x, z)))
end

with Base.isfusing(...) = true being the default. This would also address #22053 (cc @malmaud).

Pro:

  • Makes life easier in the short run for containers based on external libraries that only support a few efficient operations. They could overload isfusing and broadcast(::typeof(foo), ...) for a small number of supported foo.

  • Makes life easier in specialized applications where it may not be possible to define broadcast for general functions, e.g. in Convex.jl where it needs to guarantee convexity.

Con:

  • You can no longer look at an expression like x .+ y .* x and know that it fuses into a single loop with no temporaries.

  • There is no middle ground. A single non-fusable operand will "spoil" fusion for an entire set of nested dot calls. (To fuse a subset of an expression, you'd have to explicitly assign it to a temporary array.) (We could lower to a nested set of isfusing calls, of course, but then you'd get an exponential explosion in lowered code size.)

In the long run, I really think that packages like TensorFlow.jl should exploit the underlying library's ability to define custom operations as callback functions to implement broadcast(f::Function, ...) for arbitrary f, at which point they can defining isfusing(...) = true and get all the benefits of fusion.

@stevengj stevengj added speculative Whether the change will be implemented is speculative broadcast Applying a function over a collection labels May 25, 2017
@malmaud
Copy link
Contributor

malmaud commented May 25, 2017

This seems like a good solution.

It reminds me of the solution we had to package precompilation - one non-precompile-enabled package disables all its dependents from precompiling, just like one non-fusing operand spoils a whole call. That put pressure on package authors to support precompiling, but gave them some time and flexibility to do so and preserved backwards compatibility. This solution gives us a similar path forward for supporting fusing across the Julia ecosystem.

@nalimilan
Copy link
Member

That would also fix #19313 (though I'm less interested in it since Nullable is going to be replaced with Union{T, Null} to represent missingness).

@oxinabox
Copy link
Contributor

I am right in saying that the branch in the lowered-code would be optimized away during specialization on input types? If used within a function where x,y,z are type-stable.

@stevengj
Copy link
Member Author

@oxinabox, yes.

@ylxdzsw
Copy link
Contributor

ylxdzsw commented May 26, 2017

Couldn't we do the same thing without this feature? For example, if we want to overload Tensor for their element-wise operations

function broadcast(f, x::Tensor, y::Tensor)
  f(Wrapper(x), Wrapper(y)).data
end

+(x::Wrapper{Tensor}, y::Wrapper{Tensor}) = element_wise_add(x.data, y.data) |> Wrapper
*(x::Wrapper{Tensor}, y::Wrapper{Tensor}) = element_wise_mult(x.data, y.data) |> Wrapper

so for x .+ y .* z, f become (x,y,z)->x+y*z, and it is called with the Wrappers. Then + and * also get called with Wrappers, so the actual Tensor are called by element_wise_mult and element_wise_add finally.

@stevengj
Copy link
Member Author

@ylxdzsw, good trick! That might well work.

@KristofferC
Copy link
Member

Could that be seen as "lifting" the Arrays so they are considered scalars from the p.o.v of broadcast, defining operations in these scalars and then extracting out the Array's again?

@MikeInnes
Copy link
Member

It's a neat trick but I'm not sure how it will go in practice; at the least it's going to have an impact on error messages and traces.

How would you get multiple types (which could each be fusing or not) to play nicely together? It seems like you'd need some standard Wrapper type to do the dispatch on, and some way of marking what should be wrapped before broadcast – which devolves pretty quickly into a more complicated version of this approach :)

@stevengj
Copy link
Member Author

stevengj commented May 26, 2017

@MikeInnes, it is more complicated, but think it could be done entirely with dispatch by defining a broadcast(f, ::Union{Tensor,X,Y}...) method for all X and Y that your library supports broadcasting tensors with (presumably only a small number), rather than by changes to lowering.

@vchuravy
Copy link
Member

The problem with that is that we are talking about a diverse set of array-like containers that come from different libraries. I am definitely in favour of the trait-like implementation.

@oxinabox
Copy link
Contributor

The problem with that is that we are talking about a diverse set of array-like containers that come from different libraries. I am definitely in favour of the trait-like implementation.

Indeed, even calling some of them containers is misleading.
Eg TensorFlow's Tensors do not hold values as such.
They describe a computational graph in terms of tensors.
When that graph is executed, then some containers are filled in the C++ backend.
I guess objects-with-array-like-semantics is a bit of a mouthful though.

I'm interested in @ylxdzsw "trick", I think we would need to try it out, to get clarity

@MikeInnes
Copy link
Member

cc @madeleineudell

@MikeInnes
Copy link
Member

MikeInnes commented Jun 23, 2017

also @denizyuret

This came up quite a few times at JuliaCon so I put something together to hack around it. Please try it out and let me know how it goes.

I have to say, I'm pretty disappointed that Base has left a bunch of packages out in the cold like this. I hope it's worth it to save some characters on broadcast! in a few places.

@malmaud
Copy link
Contributor

malmaud commented Aug 30, 2017

Can we add this to the 1.0 milestone?

@StefanKarpinski StefanKarpinski added this to the 1.0 milestone Aug 31, 2017
@StefanKarpinski
Copy link
Member

While I've added this to the 1.0 milestone, we may want to consider internals of how dot-fusion syntax works to be unstable.

@JeffBezanson JeffBezanson changed the title trait-like rule to disable broadcast fusion better internal interface for extending broadcast Nov 20, 2017
@JeffBezanson
Copy link
Member

We've decided that broadcast internals can change over 1.x.

@JeffBezanson JeffBezanson modified the milestones: 1.0, 1.x Nov 20, 2017
mbauman added a commit that referenced this issue Apr 23, 2018
This patch represents the combined efforts of four individuals, over 60
commits, and an iterated design over (at least) three pull requests that
spanned nearly an entire year (closes #22063, #23692, #25377 by superceding
them).

This introduces a pure Julia data structure that represents a fused broadcast
expression.  For example, the expression `2 .* (x .+ 1)` lowers to:

```julia
julia> Meta.@lower 2 .* (x .+ 1)
:($(Expr(:thunk, CodeInfo(:(begin
      Core.SSAValue(0) = (Base.getproperty)(Base.Broadcast, :materialize)
      Core.SSAValue(1) = (Base.getproperty)(Base.Broadcast, :make)
      Core.SSAValue(2) = (Base.getproperty)(Base.Broadcast, :make)
      Core.SSAValue(3) = (Core.SSAValue(2))(+, x, 1)
      Core.SSAValue(4) = (Core.SSAValue(1))(*, 2, Core.SSAValue(3))
      Core.SSAValue(5) = (Core.SSAValue(0))(Core.SSAValue(4))
      return Core.SSAValue(5)
  end)))))
```

Or, slightly more readably as:

```julia
using .Broadcast: materialize, make
materialize(make(*, 2, make(+, x, 1)))
```

The `Broadcast.make` function serves two purposes. Its primary purpose is to
construct the `Broadcast.Broadcasted` objects that hold onto the function, the
tuple of arguments (potentially including nested `Broadcasted` arguments), and
sometimes a set of `axes` to include knowledge of the outer shape. The
secondary purpose, however, is to allow an "out" for objects that _don't_ want
to participate in fusion. For example, if `x` is a range in the above `2 .* (x
.+ 1)` expression, it needn't allocate an array and operate elementwise — it
can just compute and return a new range. Thus custom structures are able to
specialize `Broadcast.make(f, args...)` just as they'd specialize on `f`
normally to return an immediate result.

`Broadcast.materialize` is identity for everything _except_ `Broadcasted`
objects for which it allocates an appropriate result and computes the
broadcast. It does two things: it `initialize`s the outermost `Broadcasted`
object to compute its axes and then `copy`s it.

Similarly, an in-place fused broadcast like `y .= 2 .* (x .+ 1)` uses the exact
same expression tree to compute the right-hand side of the expression as above,
and then uses `materialize!(y, make(*, 2, make(+, x, 1)))` to `instantiate` the
`Broadcasted` expression tree and then `copyto!` it into the given destination.

All-together, this forms a complete API for custom types to extend and
customize the behavior of broadcast (fixes #22060). It uses the existing
`BroadcastStyle`s throughout to simplify dispatch on many arguments:

* Custom types can opt-out of broadcast fusion by specializing
  `Broadcast.make(f, args...)` or `Broadcast.make(::BroadcastStyle, f, args...)`.

* The `Broadcasted` object computes and stores the type of the combined
  `BroadcastStyle` of its arguments as its first type parameter, allowing for
  easy dispatch and specialization.

* Custom Broadcast storage is still allocated via `broadcast_similar`, however
  instead of passing just a function as a first argument, the entire
  `Broadcasted` object is passed as a final argument. This potentially allows
  for much more runtime specialization dependent upon the exact expression
  given.

* Custom broadcast implmentations for a `CustomStyle` are defined by
  specializing `copy(bc::Broadcasted{CustomStyle})` or
  `copyto!(dest::AbstractArray, bc::Broadcasted{CustomStyle})`.

* Fallback broadcast specializations for a given output object of type `Dest`
  (for the `DefaultArrayStyle` or another such style that hasn't implemented
  assignments into such an object) are defined by specializing
  `copyto(dest::Dest, bc::Broadcasted{Nothing})`.

As it fully supports range broadcasting, this now deprecates `(1:5) + 2` to
`.+`, just as had been done for all `AbstractArray`s in general.

As a first-mover proof of concept, LinearAlgebra uses this new system to
improve broadcasting over structured arrays. Before, broadcasting over a
structured matrix would result in a sparse array. Now, broadcasting over a
structured matrix will _either_ return an appropriately structured matrix _or_
a dense array. This does incur a type instability (in the form of a
discriminated union) in some situations, but thanks to type-based introspection
of the `Broadcasted` wrapper commonly used functions can be special cased to be
type stable.  For example:

```julia
julia> f(d) = round.(Int, d)
f (generic function with 1 method)

julia> @inferred f(Diagonal(rand(3)))
3×3 Diagonal{Int64,Array{Int64,1}}:
 0  ⋅  ⋅
 ⋅  0  ⋅
 ⋅  ⋅  1

julia> @inferred Diagonal(rand(3)) .* 3
ERROR: return type Diagonal{Float64,Array{Float64,1}} does not match inferred return type Union{Array{Float64,2}, Diagonal{Float64,Array{Float64,1}}}
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] top-level scope

julia> @inferred Diagonal(1:4) .+ Bidiagonal(rand(4), rand(3), 'U') .* Tridiagonal(1:3, 1:4, 1:3)
4×4 Tridiagonal{Float64,Array{Float64,1}}:
 1.30771  0.838589   ⋅          ⋅
 0.0      3.89109   0.0459757   ⋅
  ⋅       0.0       4.48033    2.51508
  ⋅        ⋅        0.0        6.23739
```

In addition to the issues referenced above, it fixes:

* Fixes #19313, #22053, #23445, and #24586: Literals are no longer treated
  specially in a fused broadcast; they're just arguments in a `Broadcasted`
  object like everything else.

* Fixes #21094: Since broadcasting is now represented by a pure Julia
  datastructure it can be created within `@generated` functions and serialized.

* Fixes #26097: The fallback destination-array specialization method of
  `copyto!` is specifically implemented as `Broadcasted{Nothing}` and will not
  be confused by `nothing` arguments.

* Fixes the broadcast-specific element of #25499: The default base broadcast
  implementation no longer depends upon `Base._return_type` to allocate its
  array (except in the empty or concretely-type cases). Note that the sparse
  implementation (#19595) is still dependent upon inference and is _not_ fixed.

* Fixes #25340: Functions are treated like normal values just like arguments
  and only evaluated once.

* Fixes #22255, and is performant with 12+ fused broadcasts. Okay, that one was
  fixed on master already, but this fixes it now, too.

* Fixes #25521.

* The performance of this patch has been thoroughly tested through its
  iterative development process in #25377. There remain [two classes of
  performance regressions](#25377) that Nanosoldier flagged.

* #25691: Propagation of constant literals sill lose their constant-ness upon
  going through the broadcast machinery. I believe quite a large number of
  functions would need to be marked as `@pure` to support this -- including
  functions that are intended to be specialized.

(For bookkeeping, this is the squashed version of the [teh-jn/lazydotfuse](#25377)
branch as of a1d4e7e. Squashed and separated
out to make it easier to review and commit)

Co-authored-by: Tim Holy <tim.holy@gmail.com>
Co-authored-by: Jameson Nash <vtjnash@gmail.com>
Co-authored-by: Andrew Keller <ajkeller34@users.noreply.github.com>
@mbauman mbauman removed the broadcast label Apr 24, 2018
Keno pushed a commit that referenced this issue Apr 27, 2018
This patch represents the combined efforts of four individuals, over 60
commits, and an iterated design over (at least) three pull requests that
spanned nearly an entire year (closes #22063, #23692, #25377 by superceding
them).

This introduces a pure Julia data structure that represents a fused broadcast
expression.  For example, the expression `2 .* (x .+ 1)` lowers to:

```julia
julia> Meta.@lower 2 .* (x .+ 1)
:($(Expr(:thunk, CodeInfo(:(begin
      Core.SSAValue(0) = (Base.getproperty)(Base.Broadcast, :materialize)
      Core.SSAValue(1) = (Base.getproperty)(Base.Broadcast, :make)
      Core.SSAValue(2) = (Base.getproperty)(Base.Broadcast, :make)
      Core.SSAValue(3) = (Core.SSAValue(2))(+, x, 1)
      Core.SSAValue(4) = (Core.SSAValue(1))(*, 2, Core.SSAValue(3))
      Core.SSAValue(5) = (Core.SSAValue(0))(Core.SSAValue(4))
      return Core.SSAValue(5)
  end)))))
```

Or, slightly more readably as:

```julia
using .Broadcast: materialize, make
materialize(make(*, 2, make(+, x, 1)))
```

The `Broadcast.make` function serves two purposes. Its primary purpose is to
construct the `Broadcast.Broadcasted` objects that hold onto the function, the
tuple of arguments (potentially including nested `Broadcasted` arguments), and
sometimes a set of `axes` to include knowledge of the outer shape. The
secondary purpose, however, is to allow an "out" for objects that _don't_ want
to participate in fusion. For example, if `x` is a range in the above `2 .* (x
.+ 1)` expression, it needn't allocate an array and operate elementwise — it
can just compute and return a new range. Thus custom structures are able to
specialize `Broadcast.make(f, args...)` just as they'd specialize on `f`
normally to return an immediate result.

`Broadcast.materialize` is identity for everything _except_ `Broadcasted`
objects for which it allocates an appropriate result and computes the
broadcast. It does two things: it `initialize`s the outermost `Broadcasted`
object to compute its axes and then `copy`s it.

Similarly, an in-place fused broadcast like `y .= 2 .* (x .+ 1)` uses the exact
same expression tree to compute the right-hand side of the expression as above,
and then uses `materialize!(y, make(*, 2, make(+, x, 1)))` to `instantiate` the
`Broadcasted` expression tree and then `copyto!` it into the given destination.

All-together, this forms a complete API for custom types to extend and
customize the behavior of broadcast (fixes #22060). It uses the existing
`BroadcastStyle`s throughout to simplify dispatch on many arguments:

* Custom types can opt-out of broadcast fusion by specializing
  `Broadcast.make(f, args...)` or `Broadcast.make(::BroadcastStyle, f, args...)`.

* The `Broadcasted` object computes and stores the type of the combined
  `BroadcastStyle` of its arguments as its first type parameter, allowing for
  easy dispatch and specialization.

* Custom Broadcast storage is still allocated via `broadcast_similar`, however
  instead of passing just a function as a first argument, the entire
  `Broadcasted` object is passed as a final argument. This potentially allows
  for much more runtime specialization dependent upon the exact expression
  given.

* Custom broadcast implmentations for a `CustomStyle` are defined by
  specializing `copy(bc::Broadcasted{CustomStyle})` or
  `copyto!(dest::AbstractArray, bc::Broadcasted{CustomStyle})`.

* Fallback broadcast specializations for a given output object of type `Dest`
  (for the `DefaultArrayStyle` or another such style that hasn't implemented
  assignments into such an object) are defined by specializing
  `copyto(dest::Dest, bc::Broadcasted{Nothing})`.

As it fully supports range broadcasting, this now deprecates `(1:5) + 2` to
`.+`, just as had been done for all `AbstractArray`s in general.

As a first-mover proof of concept, LinearAlgebra uses this new system to
improve broadcasting over structured arrays. Before, broadcasting over a
structured matrix would result in a sparse array. Now, broadcasting over a
structured matrix will _either_ return an appropriately structured matrix _or_
a dense array. This does incur a type instability (in the form of a
discriminated union) in some situations, but thanks to type-based introspection
of the `Broadcasted` wrapper commonly used functions can be special cased to be
type stable.  For example:

```julia
julia> f(d) = round.(Int, d)
f (generic function with 1 method)

julia> @inferred f(Diagonal(rand(3)))
3×3 Diagonal{Int64,Array{Int64,1}}:
 0  ⋅  ⋅
 ⋅  0  ⋅
 ⋅  ⋅  1

julia> @inferred Diagonal(rand(3)) .* 3
ERROR: return type Diagonal{Float64,Array{Float64,1}} does not match inferred return type Union{Array{Float64,2}, Diagonal{Float64,Array{Float64,1}}}
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] top-level scope

julia> @inferred Diagonal(1:4) .+ Bidiagonal(rand(4), rand(3), 'U') .* Tridiagonal(1:3, 1:4, 1:3)
4×4 Tridiagonal{Float64,Array{Float64,1}}:
 1.30771  0.838589   ⋅          ⋅
 0.0      3.89109   0.0459757   ⋅
  ⋅       0.0       4.48033    2.51508
  ⋅        ⋅        0.0        6.23739
```

In addition to the issues referenced above, it fixes:

* Fixes #19313, #22053, #23445, and #24586: Literals are no longer treated
  specially in a fused broadcast; they're just arguments in a `Broadcasted`
  object like everything else.

* Fixes #21094: Since broadcasting is now represented by a pure Julia
  datastructure it can be created within `@generated` functions and serialized.

* Fixes #26097: The fallback destination-array specialization method of
  `copyto!` is specifically implemented as `Broadcasted{Nothing}` and will not
  be confused by `nothing` arguments.

* Fixes the broadcast-specific element of #25499: The default base broadcast
  implementation no longer depends upon `Base._return_type` to allocate its
  array (except in the empty or concretely-type cases). Note that the sparse
  implementation (#19595) is still dependent upon inference and is _not_ fixed.

* Fixes #25340: Functions are treated like normal values just like arguments
  and only evaluated once.

* Fixes #22255, and is performant with 12+ fused broadcasts. Okay, that one was
  fixed on master already, but this fixes it now, too.

* Fixes #25521.

* The performance of this patch has been thoroughly tested through its
  iterative development process in #25377. There remain [two classes of
  performance regressions](#25377) that Nanosoldier flagged.

* #25691: Propagation of constant literals sill lose their constant-ness upon
  going through the broadcast machinery. I believe quite a large number of
  functions would need to be marked as `@pure` to support this -- including
  functions that are intended to be specialized.

(For bookkeeping, this is the squashed version of the [teh-jn/lazydotfuse](#25377)
branch as of a1d4e7e. Squashed and separated
out to make it easier to review and commit)

Co-authored-by: Tim Holy <tim.holy@gmail.com>
Co-authored-by: Jameson Nash <vtjnash@gmail.com>
Co-authored-by: Andrew Keller <ajkeller34@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
broadcast Applying a function over a collection gpu Affects running Julia on a GPU speculative Whether the change will be implemented is speculative
Projects
None yet