TArray Indexing Performance #83

KDr2 · 2021-01-07T07:07:27Z

Trying to benchmark, profile, and improve the performance of TArray indexing.

devmotion · 2021-01-07T09:36:07Z

test/benchmarks.jl

 print("indexing: ")
-@btime $A[$x, $y] + $A[$x, $y]
+@btime @rep INTENSITY $A[$x, $y]


Does this interpolate the variables correctly in the @btime macro? For accurate benchmarking it might be good to follow the standard suggestions and benchmark proper functions with arguments being interpolated, i.e., something like @btime f($x, $y) (instead of @btime f(x, y)).

I think it expands correctly, I checked the macro expand results. I changed them to the function calling form, the results didn't change.

devmotion · 2021-01-07T09:37:54Z

I assume that performance of getindex can be improved by adding @propagate_inbounds (see https://docs.julialang.org/en/v1/devdocs/boundscheck/#Propagating-inbounds).

KDr2 · 2021-01-07T11:29:48Z

I assume that performance of getindex can be improved by adding @propagate_inbounds (see https://docs.julialang.org/en/v1/devdocs/boundscheck/#Propagating-inbounds).

It decreases about 50% time usage 👍

devmotion · 2021-01-07T12:05:59Z

I am not sure if it is useful for view as well.

KDr2 · 2021-01-08T02:54:37Z

Results of the benchmarks:

Time of Indexing (ns)

Intensity	Array	TArray
1	2.541	424.809
2	2.541	814.048
3	2.541	1277
4	2.544	1711
5	2.543	2065
6	2.543	2551

Time of Set-Indexing (ns)

Intensity	Array	TArray
1	2.876	505.067
2	2.881	1029
3	2.843	1561
4	2.882	2041
5	4.225	2496
6	4.225	3037

Time of Broadcasting (ms)

Intensity	Array	TArray
1	2.869	6.898
2	5.971	13.976
3	8.938	21.024
4	2.882	2041
5	15.225	35.192
6	19.283	44.172

KDr2 · 2021-01-12T07:25:06Z

I did some optimization for TArray{Float64, 2} indexing: move these arrays from task local storage to a field of CTask, which makes it type stable, and here's the benchmark results:

Time of Indexing (ns)

Intensity	Array	TArray	optimized for Matrix{Float64}
1	2.541	424.809	34
2	2.541	814.048	67
3	2.541	1277	99
4	2.544	1711	124
5	2.543	2065	152
6	2.543	2551	207

Time of Set-Indexing (ns)

Intensity	Array	TArray	optimized for Matrix{Float64}
1	2.876	505.067	146
2	2.881	1029	292
3	2.843	1561	443
4	2.882	2041	568
5	4.225	2496	719
6	4.225	3037	876

Indexing time consumption is reduced to 1/10. The caveat is that one must create a TArray{Float64,2} is a CTask, if one tries to create a TArray{Float64,2} in an ordinary task, an error will come out.

devmotion · 2021-01-12T15:37:44Z

It feels strange and somewhat not right to hard code a special case for Matrix{Float64}. Can we instead add the type of the task-local array as a type parameter to TArray and/or add type assertions to fix the type instability? I mean, we know the type when we construct the TArray, don't we?

KDr2 · 2021-01-13T00:49:18Z

It feels strange and somewhat not right to hard code a special case for Matrix{Float64}. Can we instead add the type of the task-local array as a type parameter to TArray and/or add type assertions to fix the type instability? I mean, we know the type when we construct the TArray, don't we?

I tried simply adding type assertions but it didn't bring any performance improvement:

n, d::Array{T, N} = task_local_storage(...)

Maybe I didn't use it correctly?

I discussed with Hong and will try another approach: move the underlying Array to struct TArray.

KDr2 · 2021-01-13T10:05:20Z

@yebai

Moved TArray's underlying data from task local storage to TArray.data, here's the benchmark results:

Time of Indexing (ns)

Intensity	Array	old TArray	Data in TArray
1	2.541	424.809	57
2	2.541	814.048	207
3	2.541	1277	159
4	2.544	1711	208
5	2.543	2065	268
6	2.543	2551	309

Time of Set-Indexing (ns)

Intensity	Array	old TArray	Data in TArray
1	2.876	505.067	193
2	2.881	1029	383
3	2.843	1561	568
4	2.882	2041	777
5	4.225	2496	988
6	4.225	3037	1181

Time of Broadcasting (ms)

Intensity	Array	Data in TArray
1	1.139	1.160
2	2.298	2.299
3	3.485	3.483
4	4.704	4.687
5	5.779	5.832
6	7.732	7.780

devmotion

It would be good to add more comments and explanations, also to motivate specific design choices. Sometimes it is a bit difficult to read and understand the implementation.

devmotion · 2021-01-13T14:03:03Z

src/tarray.jl

    orig_task :: Task
-    TArray{T,N}() where {T,N} = new(gensym(), current_task())
+    data::Dict{Task, Tuple{Int, AbstractArray{T, N}}}


data should be concretely typed if possible, other get(data, task) is not inferrable.

Indeed, I used Array{T, N} at the beginning but not AbstractArray{N, T} and the indexing performance was better than it is now, but if we use a concrete array type, we can't wrap all kinds of arrays into a TArray, e.g. SubArray returned from view.

Using AbstractArray brings a little performance decrease but not that bad I thought, and it is compatible with all kinds of arrays.

I had in mind something like

struct TArray{T,N,A<:AbstractArray{T,N}} <: AbstractArray{T,N} orig::Task data::Dict{Task,Tuple{Int,A}} end

Could we do this? Or are there any problems?

Good idea, I will have a try tomorrow.

It seems this works very well, see the latest commit.

src/tarray.jl

devmotion · 2021-01-13T14:13:27Z

src/tarray.jl

+_get(x) = x
+function _get(x::TArray{T, N}) where {T, N}
+    n, d = x.data[current_task()]
+    return d


Can't be inferred either (see above)

@devmotion Just to clarify, do you mean the types of x::TArray need to be parametric / more concrete?

@KDr2 _get and _get_for_write looks similiar to _local_storage. If that's true, maybe merge these functions into _get_local_storage / _set_local_storage for clarity?

Just to clarify, do you mean the types of x::TArray need to be parametric / more concrete?

Yes, I mean that the type of d = last(x.data[task]) can only be inferred if data::Dict{Task,Tuple{Int,V}} is a field of TArray with concrete value type V that is e.g. a type parameter of TArray.

Just to clarify, this is not related to the type parameters T and N and corresponding where clauses in the definition of _get. They are not helpful for type inference.

src/tarray.jl

KDr2 · 2021-01-14T14:49:28Z

New benchmark results after commit "underlying array type as a parametric type of TArray":

Time of Indexing (ns)

Intensity	Array	old TArray	Data in TArray
1	2.541	424.809	21
2	2.541	814.048	37
3	2.541	1277	54
4	2.544	1711	69
5	2.543	2065	85
6	2.543	2551	100

Time of Set-Indexing (ns)

Intensity	Array	old TArray	Data in TArray
1	2.876	505.067	137
2	2.881	1029	276
3	2.843	1561	389
4	2.882	2041	550
5	4.225	2496	667
6	4.225	3037	785

Time of Broadcasting (ms)

Intensity	Array	Data in TArray
1	2.870	2.999
2	5.987	6.089
3	8.958	9.127
4	12.280	12.454
5	15.337	15.416
6	19.263	19.596

src/tarray.jl

devmotion · 2021-01-14T14:51:33Z

src/tarray.jl

@@ -259,7 +283,7 @@ Base.:*(x::AbstractArray, y::TArray) = x * _get(y) |> localize
 Base.:*(x::TArray, y::AbstractArray) = _get(x) * y |> localize

 # broadcast
-Base.BroadcastStyle(::Type{TArray{T, N}}) where {T, N} = Broadcast.ArrayStyle{TArray}()
+Base.BroadcastStyle(::Type{TArray{T, N, A}}) where {T, N, A} = Broadcast.ArrayStyle{TArray}()


Are there some potential problems with dropping the type information here? It seems it is not used anyway in the implementation of broadcasted so it should be fine, I assume?

I am also curious about this: if I use ::Type{TArray} or ::Type{TArray{T, N}}, the broadcasting benchmark will consume about 10x as much time as using ::Type{TArray{T, N, A}}.

Maybe it falls back to some more expensive implementation in Julia base if the type is not concrete?

But initial question was more about ArrayStyle{TArray} than ::Type{TArray{T,N,A}} though.

Oh, ArrayStyle{TArray} is just a singleton used to dispatch function calls to the propriety Broadcast.broadcasted implementation, so I think it's OK to drop the parametric types.

According to the documentation for AbstractArrays the preferred definition is

Suggested change

Base.BroadcastStyle(::Type{TArray{T, N, A}}) where {T, N, A} = Broadcast.ArrayStyle{TArray}()

Base.BroadcastStyle(::Type{<:TArray}) = Broadcast.ArrayStyle{TArray}()

which is basically what you had above (and therefore should give similar performance).

Base.BroadcastStyle(::Type{<:TArray}) = Broadcast.ArrayStyle{TArray}()

I'd second the one from Julia documentation since it is more concise.

devmotion · 2021-01-14T15:21:52Z

src/tarray.jl

@@ -259,7 +283,7 @@ Base.:*(x::AbstractArray, y::TArray) = x * _get(y) |> localize
 Base.:*(x::TArray, y::AbstractArray) = _get(x) * y |> localize

 # broadcast
-Base.BroadcastStyle(::Type{TArray{T, N}}) where {T, N} = Broadcast.ArrayStyle{TArray}()
+Base.BroadcastStyle(::Type{TArray{T, N, A}}) where {T, N, A} = Broadcast.ArrayStyle{TArray}()
 Broadcast.broadcasted(::Broadcast.ArrayStyle{TArray}, f, args...) = f.(_get.(args)...) |> localize


Can we remove this method? Or, if the default fallback is not sufficient (I assume it's not), implement

Base.similar(bc::Broadcast.Broadcasted{Broadcast.ArrayStyle{TArray}}, ::Type{ElType}) where ElType

instead as suggested in the example in the documentation?

I tried this:

--- a/src/tarray.jl +++ b/src/tarray.jl @@ -46,6 +46,7 @@ TArray{T}(::UndefInitializer, dim::NTuple{N,Int}) where {T,N} = TArray(T, dim) TArray{T,N}(d::Vararg{<:Integer,N}) where {T,N} = TArray(T, d) TArray{T,N}(::UndefInitializer, d::Vararg{<:Integer,N}) where {T,N} = TArray{T,N}(d) TArray{T,N}(dim::NTuple{N,Int}) where {T,N} = TArray(T, dim) +TArray{T, N, Array{T, N}}(::UndefInitializer, dim::NTuple{N,Int}) where {T,N} = TArray(T, dim) function TArray(T::Type, dim) N_dim = length(dim) @@ -283,8 +284,10 @@ Base.:*(x::AbstractArray, y::TArray) = x * _get(y) |> localize Base.:*(x::TArray, y::AbstractArray) = _get(x) * y |> localize # broadcast -Base.BroadcastStyle(::Type{TArray{T, N, A}}) where {T, N, A} = Broadcast.ArrayStyle{TArray}() -Broadcast.broadcasted(::Broadcast.ArrayStyle{TArray}, f, args...) = f.(_get.(args)...) |> localize +Base.BroadcastStyle(::Type{TArray{T, N, A}}) where {T, N, A} = Broadcast.ArrayStyle{TArray{T, N, A}}() +# Broadcast.broadcasted(::Broadcast.ArrayStyle{TArray}, f, args...) = f.(_get.(args)...) |> localize +Base.similar(bc::Broadcast.Broadcasted{Broadcast.ArrayStyle{TArray{T, N, A}}}, ::Type{T}) where {T, N, A} = + similar(TArray{T, N, Array{T, N}}, axes(bc)) import LinearAlgebra import LinearAlgebra: \, /, inv, det, logdet, logabsdet, norm

It became much slower (2.8 ms -> 192 ms).

thanks, @KDr2 and @devmotion!

KDr2 · 2021-01-19T15:43:10Z

About the benchmarks of getindex, I checked generated LLVM IR, multiple get-indexing expressions in a function are optimized to one expression:

f(A, i)= begin
A[i]
A[i]
A[i]
A[i]
end

becomes

f(A, i) = A[i]

Because the compiler knows that if A is a built-in Array, only one expression is used (as the return value).

I will try to find a way to prevent this optimization.

yebai

thanks @KDr2 - I've added some comments below. The PR looks good in general, however, there are a few places that can be improved for clarity/performance

src/tarray.jl

yebai · 2021-01-26T15:41:53Z

src/tarray.jl

+_get(x) = x
+function _get(x::TArray{T, N}) where {T, N}
+    n, d = x.data[current_task()]
+    return d


@devmotion Just to clarify, do you mean the types of x::TArray need to be parametric / more concrete?

yebai · 2021-01-26T15:43:19Z

src/tarray.jl

+    n, d = x.data[current_task()]
+    return d
+end
+function _get_for_write(x::TArray)


Maybe consider unifying _get and _get_for_write, e.g. by adding a flag to force deepcopy?

yebai · 2021-01-26T15:48:31Z

src/tarray.jl

+_get(x) = x
+function _get(x::TArray{T, N}) where {T, N}
+    n, d = x.data[current_task()]
+    return d


@KDr2 _get and _get_for_write looks similiar to _local_storage. If that's true, maybe merge these functions into _get_local_storage / _set_local_storage for clarity?

yebai · 2021-01-26T15:52:04Z

src/tarray.jl

@@ -259,7 +283,7 @@ Base.:*(x::AbstractArray, y::TArray) = x * _get(y) |> localize
 Base.:*(x::TArray, y::AbstractArray) = _get(x) * y |> localize

 # broadcast
-Base.BroadcastStyle(::Type{TArray{T, N}}) where {T, N} = Broadcast.ArrayStyle{TArray}()
+Base.BroadcastStyle(::Type{TArray{T, N, A}}) where {T, N, A} = Broadcast.ArrayStyle{TArray}()


Base.BroadcastStyle(::Type{<:TArray}) = Broadcast.ArrayStyle{TArray}()

I'd second the one from Julia documentation since it is more concise.

yebai · 2021-01-26T15:52:37Z

src/tarray.jl

@@ -259,7 +283,7 @@ Base.:*(x::AbstractArray, y::TArray) = x * _get(y) |> localize
 Base.:*(x::TArray, y::AbstractArray) = _get(x) * y |> localize

 # broadcast
-Base.BroadcastStyle(::Type{TArray{T, N}}) where {T, N} = Broadcast.ArrayStyle{TArray}()
+Base.BroadcastStyle(::Type{TArray{T, N, A}}) where {T, N, A} = Broadcast.ArrayStyle{TArray}()
 Broadcast.broadcasted(::Broadcast.ArrayStyle{TArray}, f, args...) = f.(_get.(args)...) |> localize


thanks, @KDr2 and @devmotion!

KDr2 · 2021-01-26T22:24:33Z

The latest commit:

documentation update
renamed function keep to register_to_keeper
revised local storage functions: _local_stroge, _get, _get_for_write to _set_local_storage and _get_local_storage
using Base.similar in implementation of broadcasting (though not performant)

devmotion · 2021-01-27T08:30:58Z

using Base.similar in implementation of broadcasting (though not performant)

Why did you change it? I would be fine with the implementation of broadcasted, I was mainly curious if and why it is needed. I only suggested to simplify the implementation of BroadcastStyle slightly to

Base.BroadcastStyle(::Type{<:TArray}) = Broadcast.ArrayStyle{TArray}()

as shown in the documentation.

yebai

Looks good - happy to merge once the broadcasting issue raised by @devmotion is addressed!

KDr2 · 2021-01-27T20:21:05Z

using Base.similar in implementation of broadcasting (though not performant)

Why did you change it? I would be fine with the implementation of broadcasted, I was mainly curious if and why it is needed. I only suggested to simplify the implementation of BroadcastStyle slightly to
Base.BroadcastStyle(::Type{<:TArray}) = Broadcast.ArrayStyle{TArray}()
as shown in the documentation.

Sorry, I misunderstood, fixed now.

KDr2 added 2 commits January 5, 2021 15:57

improve TArray benchmarks

c8af0f4

update benchmarks

fd4abd6

devmotion reviewed Jan 7, 2021

View reviewed changes

1. Base.@propagate_inbounds, 2. use function in benchmarks

ba7d533

optimize indexing for Matrix{Float64}

cf3bbaf

indent code

3f9c6e7

move TArray data into TArray

f1f9e6d

update benchmarks

694c23e

KDr2 requested a review from devmotion January 13, 2021 11:14

devmotion reviewed Jan 13, 2021

View reviewed changes

KDr2 added 3 commits January 14, 2021 00:22

update comments and fix other issues

f0a8179

more performant way to perdicate if a weakref is empty

13489c8

underlying array type as a parametric type of TArray

ea7fbc5

devmotion reviewed Jan 14, 2021

View reviewed changes

src/tarray.jl Outdated Show resolved Hide resolved

minor update

6e192db

devmotion reviewed Jan 14, 2021

View reviewed changes

KDr2 requested a review from yebai January 15, 2021 03:06

yebai requested changes Jan 26, 2021

View reviewed changes

update according to code review

aaf00d2

KDr2 requested a review from yebai January 26, 2021 22:26

yebai approved these changes Jan 27, 2021

View reviewed changes

yebai changed the title ~~[WIP] TArray Indexing Performance~~ TArray Indexing Performance Jan 27, 2021

fix broadcasting

bddfc26

Update Project.toml

ad7ee12

yebai merged commit c66b8c9 into master Jan 28, 2021

delete-merged-branch bot deleted the performance branch January 28, 2021 16:03

	Base.BroadcastStyle(::Type{TArray{T, N, A}}) where {T, N, A} = Broadcast.ArrayStyle{TArray}()
	Base.BroadcastStyle(::Type{<:TArray}) = Broadcast.ArrayStyle{TArray}()

TArray Indexing Performance #83

TArray Indexing Performance #83

Uh oh!

Conversation

KDr2 commented Jan 7, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

devmotion commented Jan 7, 2021

Uh oh!

KDr2 commented Jan 7, 2021

Uh oh!

devmotion commented Jan 7, 2021

Uh oh!

KDr2 commented Jan 8, 2021

Time of Indexing (ns)

Time of Set-Indexing (ns)

Time of Broadcasting (ms)

Uh oh!

KDr2 commented Jan 12, 2021

Uh oh!

devmotion commented Jan 12, 2021

Uh oh!

KDr2 commented Jan 13, 2021

Uh oh!

KDr2 commented Jan 13, 2021

Uh oh!

devmotion left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KDr2 commented Jan 14, 2021

Uh oh!

Uh oh!

devmotion Jan 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

devmotion Jan 14, 2021 •

edited

Loading