-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make "dot" operations (.+ etc) fusing broadcasts #17623
Conversation
This changes semantics and isn't a backport candidate. |
@JeffBezanson, I'm noticing an odd behavior with the compiler in the REPL. Basically, the first fused function I compile is slow, but the second and subsequent ones are fast. In particular, the following x = rand(10^7);
f(x) = x .+ 3.*x.^3 .+ 4.*x.^2
@time f(x);
@time f(x); reporting 40M allocations even on the second run. However, the same function is fast if I compile a different fused function first! x = rand(10^7);
g(x) = x .+ 3.*x.^3 .- 4.*x.^2
@time g(x);
f(x) = x .+ 3.*x.^3 .+ 4.*x.^2
@time f(x);
@time f(x); This reports only 8 allocations for Any idea what could cause this? (I'll try to reproduce it in the master branch, to see if it affects the 0.5 loop fusion, and file a separate issue if that is the case.) |
#17759 ? |
Ah, thanks @tkelman, that seems like the culprit. Using |
As long as I avoid @jrevels, it would be good to get some "vectorized operation" performance benchmarks on @nanosoldier. |
@@ -865,7 +865,7 @@ | |||
(begin | |||
#;(if (and (number? ex) (= ex 0)) | |||
(error "juxtaposition with literal \"0\"")) | |||
`(call * ,ex ,(parse-unary s)))) | |||
`(call .* ,ex ,(parse-unary s)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not convinced this revised behaviour is a good thing. This could break things like 2x
where x::Diagonal
, which broadcast
will try to promote to Array
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 .* x
for x::Diagonal
is also broken by this PR; why is breaking 2x
worse?
(Such cases could be fixed by adding specialized broadcast
methods, of course.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*
has different semantics from .*
. I think it is a mistake to treat the latter as a superset of the former, as is being done with this implicit multiplication lowering to .*
. Intuitively, 2x
means 2 * x
, not 2 .* x
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't have different semantics for multiplication by scalars...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But there is no guarantee that 2 .* x
is the same as 2 * x
for all types.
Also, this looks to be a performance disaster for the common scalar case.
julia> @code_llvm broadcast(*, 2, 2)
define %jl_value_t* @julia_broadcast_65524(%jl_value_t*, %jl_value_t**, i32) #0 {
top:
%3 = alloca %jl_value_t**, align 8
store volatile %jl_value_t** %1, %jl_value_t*** %3, align 8
%4 = add i32 %2, -1
%5 = icmp eq i32 %4, 0
br i1 %5, label %fail, label %pass
fail: ; preds = %top
%6 = getelementptr %jl_value_t*, %jl_value_t** %1, i64 1
call void @jl_bounds_error_tuple_int(%jl_value_t** %6, i64 0, i64 1)
unreachable
pass: ; preds = %top
%7 = icmp ugt i32 %4, 1
br i1 %7, label %pass.2, label %fail1
fail1: ; preds = %pass
%8 = sext i32 %4 to i64
%9 = getelementptr %jl_value_t*, %jl_value_t** %1, i64 1
call void @jl_bounds_error_tuple_int(%jl_value_t** %9, i64 %8, i64 2)
unreachable
pass.2: ; preds = %pass
%10 = getelementptr %jl_value_t*, %jl_value_t** %1, i64 1
%11 = bitcast %jl_value_t** %10 to i64**
%12 = load i64*, i64** %11, align 8
%13 = load i64, i64* %12, align 16
%14 = getelementptr %jl_value_t*, %jl_value_t** %1, i64 2
%15 = bitcast %jl_value_t** %14 to i64**
%16 = load i64*, i64** %15, align 8
%17 = load i64, i64* %16, align 16
%18 = mul i64 %17, %13
%19 = call %jl_value_t* @jl_box_int64(i64 signext %18)
ret %jl_value_t* %19
}
I guess I am not understanding what is gained by this change. Loop fusion can be forced explicitly with .*
anyway. Why should the scalar case be disrupted for the convenience of the vector case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, it's not a big deal.
But why is it a performance disaster for the scalar case? Shouldn't it be getting inlined to be equivalent to 2 * 2
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performance looks good to me:
julia> f(x, y) = broadcast(*, x, y)
f (generic function with 1 method)
julia> @code_llvm f(2,2)
define i64 @julia_f_70407(i64, i64) #0 {
top:
%2 = mul i64 %1, %0
ret i64 %2
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyway, I'll revert this part of the PR, since it is controversial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why @code_llvm
on the broadcast
itself is so scary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jlcall
ab00ecf
to
47e8eca
Compare
One of the difficulties I'm having with this PR is that it makes it effectively impossible to define specialized methods for For example, in Julia ≤ 0.5 we have specialized methods for The problem is that as soon as you fuse the operation with another dot call, it produces a fused anonymous function and the specialized Do we have to give up on |
See also ongoing discussion in #18590 (comment). |
4932623
to
28f6a3f
Compare
This has removed a few optimizations for BitArrays. One in particular is the case I wonder if there is a way to catch this case again and get the same performance as |
Yup, the general problem here is Now that boolean operations are fused, however, it's not clear to me how often one does non-fused operations on bitarrays. We used to need it for things like It also seems to me that there is quite a bit of unrolling that could be done to make chunk-by-chunk processing of BitArrays more efficient, e.g. for |
.>=, | ||
.≥, | ||
.\, | ||
.^, | ||
/, | ||
//, | ||
.//, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason this is still here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably just missed - likewise with .>>
and .<<
below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, just missed them, sorry.
This appears to have broken the ability for packages to define and use |
Whoops, that wasn't intended. |
Will have a PR to fix |
This is a
WIPfinished PR making dot operations into "fusing" broadcasts. That is,x .⨳ y
(for any binary operator⨳
) is transformed by the parser into(⨳).(x,y)
, which in turn is fused with other nested "dot" calls into a singlebroadcast
.To do:
x .⨳ y
as a fusing "dot" function call..⨳
method definitions tobroadcast(::typeof(⨳), ...)
definitions. (Currently, these methods are silently ignored.)MethodError: no method matching splice!(::Array{UInt8,1}, ::Array{Int64,1}, ::Array{UInt8,1})
.)[true] .* [true]
gives a stack overflow.Float64 * Array{Float32} = Array{Float32}
etc. (for non-dot ops)broadcast(::typeof(op), ...)
methods as possible