Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make "dot" operations (.+ etc) fusing broadcasts #17623

Merged
merged 4 commits into from
Dec 20, 2016

Conversation

stevengj
Copy link
Member

@stevengj stevengj commented Jul 26, 2016

This is a WIP finished PR making dot operations into "fusing" broadcasts. That is, x .⨳ y (for any binary operator ) is transformed by the parser into (⨳).(x,y), which in turn is fused with other nested "dot" calls into a single broadcast.

To do:

  • Parse x .⨳ y as a fusing "dot" function call.
  • Deprecate manual .⨳ method definitions to broadcast(::typeof(⨳), ...) definitions. (Currently, these methods are silently ignored.)
  • Unbreak the REPL. (Somehow this patch breaks it: typing backspace gives MethodError: no method matching splice!(::Array{UInt8,1}, ::Array{Int64,1}, ::Array{UInt8,1}).)
  • [true] .* [true] gives a stack overflow.
  • Fix lots of breaking tests due to broadcast should treat types as "scalar-like" #16966.
  • restore 0.5 behavior of Float64 * Array{Float32} = Array{Float32} etc. (for non-dot ops)
  • Fix breaking method-ambiguity test
  • More tests
  • Documentation
  • eliminate as many specialized broadcast(::typeof(op), ...) methods as possible

@ViralBShah ViralBShah added this to the 0.5.x milestone Jul 26, 2016
@tkelman
Copy link
Contributor

tkelman commented Jul 26, 2016

This changes semantics and isn't a backport candidate.

@stevengj
Copy link
Member Author

stevengj commented Aug 2, 2016

@JeffBezanson, I'm noticing an odd behavior with the compiler in the REPL. Basically, the first fused function I compile is slow, but the second and subsequent ones are fast.

In particular, the following f(x) = x .+ 3.*x.^3 .+ 4.*x.^2 is extremely slow:

x = rand(10^7);
f(x) = x .+ 3.*x.^3 .+ 4.*x.^2
@time f(x);
@time f(x);

reporting 40M allocations even on the second run. However, the same function is fast if I compile a different fused function first!

x = rand(10^7);
g(x) = x .+ 3.*x.^3 .- 4.*x.^2
@time g(x);
f(x) = x .+ 3.*x.^3 .+ 4.*x.^2
@time f(x);
@time f(x);

This reports only 8 allocations for f(x), and allocates exactly the expected amount of memory for the output array.

Any idea what could cause this? (I'll try to reproduce it in the master branch, to see if it affects the 0.5 loop fusion, and file a separate issue if that is the case.)

@tkelman
Copy link
Contributor

tkelman commented Aug 2, 2016

#17759 ?

@stevengj
Copy link
Member Author

stevengj commented Aug 2, 2016

Ah, thanks @tkelman, that seems like the culprit. Using f(x) = x .+ 3.*x.*x.*x .+ 4.*x.*x, i.e. avoiding ^, eliminates the problem for me.

@stevengj
Copy link
Member Author

stevengj commented Aug 2, 2016

As long as I avoid ^, whose problems seem orthogonal to this PR, the performance is exactly what I was hoping for. For example, y .= x .+ 3.*x.*x.*x .+ 4.*x.*x occurs entirely in-place, with performance identical to writing out the loops manually (with @inbounds annotation).

@jrevels, it would be good to get some "vectorized operation" performance benchmarks on @nanosoldier.

@@ -865,7 +865,7 @@
(begin
#;(if (and (number? ex) (= ex 0))
(error "juxtaposition with literal \"0\""))
`(call * ,ex ,(parse-unary s))))
`(call .* ,ex ,(parse-unary s))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced this revised behaviour is a good thing. This could break things like 2x where x::Diagonal, which broadcast will try to promote to Array.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 .* x for x::Diagonal is also broken by this PR; why is breaking 2x worse?

(Such cases could be fixed by adding specialized broadcast methods, of course.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* has different semantics from .*. I think it is a mistake to treat the latter as a superset of the former, as is being done with this implicit multiplication lowering to .*. Intuitively, 2x means 2 * x, not 2 .* x.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't have different semantics for multiplication by scalars...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But there is no guarantee that 2 .* x is the same as 2 * x for all types.

Also, this looks to be a performance disaster for the common scalar case.

julia> @code_llvm broadcast(*, 2, 2)

define %jl_value_t* @julia_broadcast_65524(%jl_value_t*, %jl_value_t**, i32) #0 {
top:
  %3 = alloca %jl_value_t**, align 8
  store volatile %jl_value_t** %1, %jl_value_t*** %3, align 8
  %4 = add i32 %2, -1
  %5 = icmp eq i32 %4, 0
  br i1 %5, label %fail, label %pass

fail:                                             ; preds = %top
  %6 = getelementptr %jl_value_t*, %jl_value_t** %1, i64 1
  call void @jl_bounds_error_tuple_int(%jl_value_t** %6, i64 0, i64 1)
  unreachable

pass:                                             ; preds = %top
  %7 = icmp ugt i32 %4, 1
  br i1 %7, label %pass.2, label %fail1

fail1:                                            ; preds = %pass
  %8 = sext i32 %4 to i64
  %9 = getelementptr %jl_value_t*, %jl_value_t** %1, i64 1
  call void @jl_bounds_error_tuple_int(%jl_value_t** %9, i64 %8, i64 2)
  unreachable

pass.2:                                           ; preds = %pass
  %10 = getelementptr %jl_value_t*, %jl_value_t** %1, i64 1
  %11 = bitcast %jl_value_t** %10 to i64**
  %12 = load i64*, i64** %11, align 8
  %13 = load i64, i64* %12, align 16
  %14 = getelementptr %jl_value_t*, %jl_value_t** %1, i64 2
  %15 = bitcast %jl_value_t** %14 to i64**
  %16 = load i64*, i64** %15, align 8
  %17 = load i64, i64* %16, align 16
  %18 = mul i64 %17, %13
  %19 = call %jl_value_t* @jl_box_int64(i64 signext %18)
  ret %jl_value_t* %19
}

I guess I am not understanding what is gained by this change. Loop fusion can be forced explicitly with .* anyway. Why should the scalar case be disrupted for the convenience of the vector case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, it's not a big deal.

But why is it a performance disaster for the scalar case? Shouldn't it be getting inlined to be equivalent to 2 * 2?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance looks good to me:

julia> f(x, y) = broadcast(*, x, y)
f (generic function with 1 method)

julia> @code_llvm f(2,2)

define i64 @julia_f_70407(i64, i64) #0 {
top:
  %2 = mul i64 %1, %0
  ret i64 %2
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, I'll revert this part of the PR, since it is controversial.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why @code_llvm on the broadcast itself is so scary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jlcall

@stevengj
Copy link
Member Author

One of the difficulties I'm having with this PR is that it makes it effectively impossible to define specialized methods for .* etcetera.

For example, in Julia ≤ 0.5 we have specialized methods for array .< scalar that produce a BitArray, and we have specialized methods for sparsevector .* sparsevector that preserve the sparsity. With this PR, I can in principle overload broadcast(::typeof(<), array, scalar), but in fact it is almost useless.

The problem is that as soon as you fuse the operation with another dot call, it produces a fused anonymous function and the specialized broadcast method is never called. (The compress-fuse optimizations that inline numeric literals and combine duplicate broadcast arguments are another problem.)

Do we have to give up on .* for sparse arrays, and do we have to give up on elementwise boolean operations producing a BitArray? Or should broadcast always generate a BitArray if it detects that the output type is Bool?

@TotalVerb
Copy link
Contributor

See also ongoing discussion in #18590 (comment).

@carlobaldassi
Copy link
Member

This has removed a few optimizations for BitArrays. One in particular is the case A .* B when A and B have the same shape, which previously was specialized and called A & B. The difference is quite significant, e.g. for 1000x1000 BitArrays it's almost 40-fold.

I wonder if there is a way to catch this case again and get the same performance as map. Also, now that we have a generalized dot-operator syntax, it would be particularly useful to exploit the cases when the function is pure and operate chunk-wise (also in map, which currently only recognizes a few operations). Is there a way to determine if a function is pure?

@stevengj
Copy link
Member Author

stevengj commented Dec 22, 2016

Yup, the general problem here is broadcast(::typeof(*),...) and similar methods aren't that useful to define, because they won't be called in lots of cases due to fusion. (In the particular case of A .* B, of course, one could simply call A & B.) As far as I know, we don't currently have any way to determine whether a function is pure.

Now that boolean operations are fused, however, it's not clear to me how often one does non-fused operations on bitarrays. We used to need it for things like (A .> 6) .& (B .> 0), but now this is fused into a single loop. How common are operations on large boolean arrays?

It also seems to me that there is quite a bit of unrolling that could be done to make chunk-by-chunk processing of BitArrays more efficient, e.g. for broadcast! with a BitArray result or getindex(A, ::BitArray). We should probably just write out (via metaprogramming) the unrolled loop for all 64 bits in a chunk. But that is separate from the question of chunk-wise pure bit operations.

.>=,
.≥,
.\,
.^,
/,
//,
.//,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason this is still here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably just missed - likewise with .>> and .<< below

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, just missed them, sorry.

@tkelman
Copy link
Contributor

tkelman commented Dec 27, 2016

This appears to have broken the ability for packages to define and use .. as an operator.

@stevengj
Copy link
Member Author

Whoops, that wasn't intended.

@stevengj
Copy link
Member Author

Will have a PR to fix .. shortly.

@ararslan ararslan mentioned this pull request Jan 9, 2017
tkelman referenced this pull request in JuliaLinearAlgebra/BandedMatrices.jl Feb 4, 2017
ajkeller34 referenced this pull request in PainterQubits/Unitful.jl Feb 21, 2017
tkelman referenced this pull request in SciML/DiffEqProblemLibrary.jl May 4, 2017
@Sacha0 Sacha0 added the deprecation This change introduces or involves a deprecation label May 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
broadcast Applying a function over a collection deprecation This change introduces or involves a deprecation
Projects
None yet
Development

Successfully merging this pull request may close these issues.