Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TArray Indexing Performance #83

Merged
merged 14 commits into from
Jan 28, 2021
Merged

TArray Indexing Performance #83

merged 14 commits into from
Jan 28, 2021

Conversation

KDr2
Copy link
Member

@KDr2 KDr2 commented Jan 7, 2021

Trying to benchmark, profile, and improve the performance of TArray indexing.

print("indexing: ")
@btime $A[$x, $y] + $A[$x, $y]
@btime @rep INTENSITY $A[$x, $y]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this interpolate the variables correctly in the @btime macro? For accurate benchmarking it might be good to follow the standard suggestions and benchmark proper functions with arguments being interpolated, i.e., something like @btime f($x, $y) (instead of @btime f(x, y)).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it expands correctly, I checked the macro expand results. I changed them to the function calling form, the results didn't change.

@devmotion
Copy link
Member

I assume that performance of getindex can be improved by adding @propagate_inbounds (see https://docs.julialang.org/en/v1/devdocs/boundscheck/#Propagating-inbounds).

@KDr2
Copy link
Member Author

KDr2 commented Jan 7, 2021

I assume that performance of getindex can be improved by adding @propagate_inbounds (see https://docs.julialang.org/en/v1/devdocs/boundscheck/#Propagating-inbounds).

It decreases about 50% time usage 👍

@devmotion
Copy link
Member

I am not sure if it is useful for view as well.

@KDr2
Copy link
Member Author

KDr2 commented Jan 8, 2021

Results of the benchmarks:

Time of Indexing (ns)

Intensity Array TArray
1 2.541 424.809
2 2.541 814.048
3 2.541 1277
4 2.544 1711
5 2.543 2065
6 2.543 2551

Time of Set-Indexing (ns)

Intensity Array TArray
1 2.876 505.067
2 2.881 1029
3 2.843 1561
4 2.882 2041
5 4.225 2496
6 4.225 3037

Time of Broadcasting (ms)

Intensity Array TArray
1 2.869 6.898
2 5.971 13.976
3 8.938 21.024
4 2.882 2041
5 15.225 35.192
6 19.283 44.172

@KDr2
Copy link
Member Author

KDr2 commented Jan 12, 2021

I did some optimization for TArray{Float64, 2} indexing: move these arrays from task local storage to a field of CTask, which makes it type stable, and here's the benchmark results:

Time of Indexing (ns)

Intensity Array TArray optimized for Matrix{Float64}
1 2.541 424.809 34
2 2.541 814.048 67
3 2.541 1277 99
4 2.544 1711 124
5 2.543 2065 152
6 2.543 2551 207

Time of Set-Indexing (ns)

Intensity Array TArray optimized for Matrix{Float64}
1 2.876 505.067 146
2 2.881 1029 292
3 2.843 1561 443
4 2.882 2041 568
5 4.225 2496 719
6 4.225 3037 876

Indexing time consumption is reduced to 1/10. The caveat is that one must create a TArray{Float64,2} is a CTask, if one tries to create a TArray{Float64,2} in an ordinary task, an error will come out.

@devmotion
Copy link
Member

It feels strange and somewhat not right to hard code a special case for Matrix{Float64}. Can we instead add the type of the task-local array as a type parameter to TArray and/or add type assertions to fix the type instability? I mean, we know the type when we construct the TArray, don't we?

@KDr2
Copy link
Member Author

KDr2 commented Jan 13, 2021

It feels strange and somewhat not right to hard code a special case for Matrix{Float64}. Can we instead add the type of the task-local array as a type parameter to TArray and/or add type assertions to fix the type instability? I mean, we know the type when we construct the TArray, don't we?

I tried simply adding type assertions but it didn't bring any performance improvement:

n, d::Array{T, N} = task_local_storage(...)

Maybe I didn't use it correctly?

I discussed with Hong and will try another approach: move the underlying Array to struct TArray.

@KDr2
Copy link
Member Author

KDr2 commented Jan 13, 2021

@yebai

Moved TArray's underlying data from task local storage to TArray.data, here's the benchmark results:

Time of Indexing (ns)

Intensity Array old TArray Data in TArray
1 2.541 424.809 57
2 2.541 814.048 207
3 2.541 1277 159
4 2.544 1711 208
5 2.543 2065 268
6 2.543 2551 309

Time of Set-Indexing (ns)

Intensity Array old TArray Data in TArray
1 2.876 505.067 193
2 2.881 1029 383
3 2.843 1561 568
4 2.882 2041 777
5 4.225 2496 988
6 4.225 3037 1181

Time of Broadcasting (ms)

Intensity Array Data in TArray
1 1.139 1.160
2 2.298 2.299
3 3.485 3.483
4 4.704 4.687
5 5.779 5.832
6 7.732 7.780

@KDr2 KDr2 requested a review from devmotion January 13, 2021 11:14
Copy link
Member

@devmotion devmotion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add more comments and explanations, also to motivate specific design choices. Sometimes it is a bit difficult to read and understand the implementation.

src/tarray.jl Outdated
orig_task :: Task
TArray{T,N}() where {T,N} = new(gensym(), current_task())
data::Dict{Task, Tuple{Int, AbstractArray{T, N}}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data should be concretely typed if possible, other get(data, task) is not inferrable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I used Array{T, N} at the beginning but not AbstractArray{N, T} and the indexing performance was better than it is now, but if we use a concrete array type, we can't wrap all kinds of arrays into a TArray, e.g. SubArray returned from view.

Using AbstractArray brings a little performance decrease but not that bad I thought, and it is compatible with all kinds of arrays.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had in mind something like

struct TArray{T,N,A<:AbstractArray{T,N}} <: AbstractArray{T,N}
    orig::Task
    data::Dict{Task,Tuple{Int,A}}
end

Could we do this? Or are there any problems?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I will have a try tomorrow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this works very well, see the latest commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

src/tarray.jl Outdated Show resolved Hide resolved
src/tarray.jl Outdated Show resolved Hide resolved
src/tarray.jl Show resolved Hide resolved
src/tarray.jl Outdated Show resolved Hide resolved
src/tarray.jl Outdated
_get(x) = x
function _get(x::TArray{T, N}) where {T, N}
n, d = x.data[current_task()]
return d
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't be inferred either (see above)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devmotion Just to clarify, do you mean the types of x::TArray need to be parametric / more concrete?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KDr2 _get and _get_for_write looks similiar to _local_storage. If that's true, maybe merge these functions into _get_local_storage / _set_local_storage for clarity?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, do you mean the types of x::TArray need to be parametric / more concrete?

Yes, I mean that the type of d = last(x.data[task]) can only be inferred if data::Dict{Task,Tuple{Int,V}} is a field of TArray with concrete value type V that is e.g. a type parameter of TArray.

Just to clarify, this is not related to the type parameters T and N and corresponding where clauses in the definition of _get. They are not helpful for type inference.

src/tarray.jl Outdated Show resolved Hide resolved
src/tarray.jl Outdated Show resolved Hide resolved
src/tarray.jl Outdated Show resolved Hide resolved
src/tarray.jl Outdated Show resolved Hide resolved
@KDr2
Copy link
Member Author

KDr2 commented Jan 14, 2021

New benchmark results after commit "underlying array type as a parametric type of TArray":

Time of Indexing (ns)

Intensity Array old TArray Data in TArray
1 2.541 424.809 21
2 2.541 814.048 37
3 2.541 1277 54
4 2.544 1711 69
5 2.543 2065 85
6 2.543 2551 100

Time of Set-Indexing (ns)

Intensity Array old TArray Data in TArray
1 2.876 505.067 137
2 2.881 1029 276
3 2.843 1561 389
4 2.882 2041 550
5 4.225 2496 667
6 4.225 3037 785

Time of Broadcasting (ms)

Intensity Array Data in TArray
1 2.870 2.999
2 5.987 6.089
3 8.958 9.127
4 12.280 12.454
5 15.337 15.416
6 19.263 19.596

src/tarray.jl Outdated Show resolved Hide resolved
src/tarray.jl Outdated
@@ -259,7 +283,7 @@ Base.:*(x::AbstractArray, y::TArray) = x * _get(y) |> localize
Base.:*(x::TArray, y::AbstractArray) = _get(x) * y |> localize

# broadcast
Base.BroadcastStyle(::Type{TArray{T, N}}) where {T, N} = Broadcast.ArrayStyle{TArray}()
Base.BroadcastStyle(::Type{TArray{T, N, A}}) where {T, N, A} = Broadcast.ArrayStyle{TArray}()
Copy link
Member

@devmotion devmotion Jan 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there some potential problems with dropping the type information here? It seems it is not used anyway in the implementation of broadcasted so it should be fine, I assume?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also curious about this: if I use ::Type{TArray} or ::Type{TArray{T, N}}, the broadcasting benchmark will consume about 10x as much time as using ::Type{TArray{T, N, A}}.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it falls back to some more expensive implementation in Julia base if the type is not concrete?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But initial question was more about ArrayStyle{TArray} than ::Type{TArray{T,N,A}} though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, ArrayStyle{TArray} is just a singleton used to dispatch function calls to the propriety Broadcast.broadcasted implementation, so I think it's OK to drop the parametric types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the documentation for AbstractArrays the preferred definition is

Suggested change
Base.BroadcastStyle(::Type{TArray{T, N, A}}) where {T, N, A} = Broadcast.ArrayStyle{TArray}()
Base.BroadcastStyle(::Type{<:TArray}) = Broadcast.ArrayStyle{TArray}()

which is basically what you had above (and therefore should give similar performance).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Base.BroadcastStyle(::Type{<:TArray}) = Broadcast.ArrayStyle{TArray}()

I'd second the one from Julia documentation since it is more concise.

src/tarray.jl Outdated
@@ -259,7 +283,7 @@ Base.:*(x::AbstractArray, y::TArray) = x * _get(y) |> localize
Base.:*(x::TArray, y::AbstractArray) = _get(x) * y |> localize

# broadcast
Base.BroadcastStyle(::Type{TArray{T, N}}) where {T, N} = Broadcast.ArrayStyle{TArray}()
Base.BroadcastStyle(::Type{TArray{T, N, A}}) where {T, N, A} = Broadcast.ArrayStyle{TArray}()
Broadcast.broadcasted(::Broadcast.ArrayStyle{TArray}, f, args...) = f.(_get.(args)...) |> localize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this method? Or, if the default fallback is not sufficient (I assume it's not), implement

Base.similar(bc::Broadcast.Broadcasted{Broadcast.ArrayStyle{TArray}}, ::Type{ElType}) where ElType

instead as suggested in the example in the documentation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this:

--- a/src/tarray.jl
+++ b/src/tarray.jl
@@ -46,6 +46,7 @@ TArray{T}(::UndefInitializer, dim::NTuple{N,Int}) where {T,N} = TArray(T, dim)
 TArray{T,N}(d::Vararg{<:Integer,N}) where {T,N} = TArray(T, d)
 TArray{T,N}(::UndefInitializer, d::Vararg{<:Integer,N}) where {T,N} = TArray{T,N}(d)
 TArray{T,N}(dim::NTuple{N,Int}) where {T,N} = TArray(T, dim)
+TArray{T, N, Array{T, N}}(::UndefInitializer, dim::NTuple{N,Int}) where {T,N} = TArray(T, dim)

 function TArray(T::Type, dim)
     N_dim = length(dim)
@@ -283,8 +284,10 @@ Base.:*(x::AbstractArray, y::TArray) = x * _get(y) |> localize
 Base.:*(x::TArray, y::AbstractArray) = _get(x) * y |> localize

 # broadcast
-Base.BroadcastStyle(::Type{TArray{T, N, A}}) where {T, N, A} = Broadcast.ArrayStyle{TArray}()
-Broadcast.broadcasted(::Broadcast.ArrayStyle{TArray}, f, args...) = f.(_get.(args)...) |> localize
+Base.BroadcastStyle(::Type{TArray{T, N, A}}) where {T, N, A} = Broadcast.ArrayStyle{TArray{T, N, A}}()
+# Broadcast.broadcasted(::Broadcast.ArrayStyle{TArray}, f, args...) = f.(_get.(args)...) |> localize
+Base.similar(bc::Broadcast.Broadcasted{Broadcast.ArrayStyle{TArray{T, N, A}}}, ::Type{T}) where {T, N, A} =
+    similar(TArray{T, N, Array{T, N}}, axes(bc))

 import LinearAlgebra
 import LinearAlgebra:  \, /, inv, det, logdet, logabsdet, norm

It became much slower (2.8 ms -> 192 ms).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, @KDr2 and @devmotion!

@KDr2 KDr2 requested a review from yebai January 15, 2021 03:06
@KDr2
Copy link
Member Author

KDr2 commented Jan 19, 2021

About the benchmarks of getindex, I checked generated LLVM IR, multiple get-indexing expressions in a function are optimized to one expression:

f(A, i)= begin
A[i]
A[i]
A[i]
A[i]
end

becomes

f(A, i) = A[i]

Because the compiler knows that if A is a built-in Array, only one expression is used (as the return value).

I will try to find a way to prevent this optimization.

Copy link
Member

@yebai yebai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @KDr2 - I've added some comments below. The PR looks good in general, however, there are a few places that can be improved for clarity/performance

src/tarray.jl Outdated Show resolved Hide resolved
src/tarray.jl Outdated Show resolved Hide resolved
src/tarray.jl Outdated Show resolved Hide resolved
src/tarray.jl Outdated Show resolved Hide resolved
src/tarray.jl Outdated Show resolved Hide resolved
src/tarray.jl Outdated
_get(x) = x
function _get(x::TArray{T, N}) where {T, N}
n, d = x.data[current_task()]
return d
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@devmotion Just to clarify, do you mean the types of x::TArray need to be parametric / more concrete?

src/tarray.jl Outdated
n, d = x.data[current_task()]
return d
end
function _get_for_write(x::TArray)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe consider unifying _get and _get_for_write, e.g. by adding a flag to force deepcopy?

src/tarray.jl Outdated
_get(x) = x
function _get(x::TArray{T, N}) where {T, N}
n, d = x.data[current_task()]
return d
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KDr2 _get and _get_for_write looks similiar to _local_storage. If that's true, maybe merge these functions into _get_local_storage / _set_local_storage for clarity?

src/tarray.jl Outdated
@@ -259,7 +283,7 @@ Base.:*(x::AbstractArray, y::TArray) = x * _get(y) |> localize
Base.:*(x::TArray, y::AbstractArray) = _get(x) * y |> localize

# broadcast
Base.BroadcastStyle(::Type{TArray{T, N}}) where {T, N} = Broadcast.ArrayStyle{TArray}()
Base.BroadcastStyle(::Type{TArray{T, N, A}}) where {T, N, A} = Broadcast.ArrayStyle{TArray}()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Base.BroadcastStyle(::Type{<:TArray}) = Broadcast.ArrayStyle{TArray}()

I'd second the one from Julia documentation since it is more concise.

src/tarray.jl Outdated
@@ -259,7 +283,7 @@ Base.:*(x::AbstractArray, y::TArray) = x * _get(y) |> localize
Base.:*(x::TArray, y::AbstractArray) = _get(x) * y |> localize

# broadcast
Base.BroadcastStyle(::Type{TArray{T, N}}) where {T, N} = Broadcast.ArrayStyle{TArray}()
Base.BroadcastStyle(::Type{TArray{T, N, A}}) where {T, N, A} = Broadcast.ArrayStyle{TArray}()
Broadcast.broadcasted(::Broadcast.ArrayStyle{TArray}, f, args...) = f.(_get.(args)...) |> localize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, @KDr2 and @devmotion!

@KDr2
Copy link
Member Author

KDr2 commented Jan 26, 2021

The latest commit:

  • documentation update
  • renamed function keep to register_to_keeper
  • revised local storage functions: _local_stroge, _get, _get_for_write to _set_local_storage and _get_local_storage
  • using Base.similar in implementation of broadcasting (though not performant)

@KDr2 KDr2 requested a review from yebai January 26, 2021 22:26
@devmotion
Copy link
Member

using Base.similar in implementation of broadcasting (though not performant)

Why did you change it? I would be fine with the implementation of broadcasted, I was mainly curious if and why it is needed. I only suggested to simplify the implementation of BroadcastStyle slightly to

Base.BroadcastStyle(::Type{<:TArray}) = Broadcast.ArrayStyle{TArray}()

as shown in the documentation.

Copy link
Member

@yebai yebai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - happy to merge once the broadcasting issue raised by @devmotion is addressed!

@yebai yebai changed the title [WIP] TArray Indexing Performance TArray Indexing Performance Jan 27, 2021
@KDr2
Copy link
Member Author

KDr2 commented Jan 27, 2021

using Base.similar in implementation of broadcasting (though not performant)

Why did you change it? I would be fine with the implementation of broadcasted, I was mainly curious if and why it is needed. I only suggested to simplify the implementation of BroadcastStyle slightly to

Base.BroadcastStyle(::Type{<:TArray}) = Broadcast.ArrayStyle{TArray}()

as shown in the documentation.

Sorry, I misunderstood, fixed now.

@yebai yebai merged commit c66b8c9 into master Jan 28, 2021
@delete-merged-branch delete-merged-branch bot deleted the performance branch January 28, 2021 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants