Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

more efficient and generic indexing through triangular types? #294

Open
Sacha0 opened this issue Dec 27, 2015 · 8 comments
Open

more efficient and generic indexing through triangular types? #294

Sacha0 opened this issue Dec 27, 2015 · 8 comments
Labels
performance Must go faster

Comments

@Sacha0
Copy link
Member

Sacha0 commented Dec 27, 2015

In JuliaLang/julia#14471, eliminating indexing through triangular types significantly improved performance. Benchmarks and analysis investigating this observation follow below. The analysis led down a generic linear algebra rabbit hole related to JuliaLang/julia#8240 and #136, discussed towards the end.

tl;dr Indexing through a triangular type is substantially slower than directly indexing the underlying data structure in some cases. Behavior-preserving, likely-uncontroversial changes to relevant getindex methods close a good portion of the performance gap. Further improvements appear to require non-behavior-preserving changes which are tied to broader design issues.

Edit: This post is long. Please don't read all of it. Each section stands alone. Skip the Benchmarks section by default; consider it an appendix.

Benchmarks

See this gist file for the benchmark code, this gist file for the LowerTriangular results, and this gist file the UnitLowerTriangular results; UpperTriangular behavior should be identical to that of LowerTriangular, and likewise for UnitUpperTriangular with UnitLowerTriangular.

The four benchmarks mimic textbook level-2 BLAS access patterns: (1) sumlt scans column-wise over the lower-triangular half of a matrix, summing the matrix's elements into a reduction variable. (2) sumall scans column-wise over the entirety of a matrix, summing the matrix's elements into a reduction variable. (3) xorlt scans column-wise over the lower-triangular half of a matrix, xor'ing the matrix's elements into a reduction variable. (4) xorall scans column-wise over the entirety of a matrix, xor'ing the matrix's elements into a reduction variable. In the sum benchmarks, matrix elements are Float64s. In the xor benchmarks, matrix elements are Int64s.

The benchmarks that scan over the lower-triangular half of a matrix reflect the behavior of algorithms that take advantage of triangular structure, whereas those benchmarks that scan over the entire matrix reflect the behavior of algorithms that ignore the triangular structure. The sum and xor benchmarks generally exhibit the same trends. But xor being simpler than floating-point addition, the xor benchmarks exhibit the trends more sharply than the sum benchmarks.

Two broad classes of matrix types underlying the triangular types are relevant here: (1) matrix types for which element access is relatively costly ('expensive element access' below); and (2) matrix types for which element access is relatively inexpensive ('cheap element access' below). As a model of the former the benchmarks test SparseMatrixCSCs, and as a model of the latter the benchmarks test Arrays; the cost of branching and simple operations are mostly irrelevant in scanning element-by-element over the former, whereas such costs are important in scanning element-by-element over the latter.

For comparison, each benchmark contains a dref_ method which directly indexes the underlying data structure. Though in the sumall and xorall benchmarks this method produces incorrect results, it remains useful for comparison.

(I also benchmarked simple scans --- the equivalent of the sum and xor benchmarks without the sum or xor for each element access, only assignment to the reduction variable. Trends in those benchmarks were similar to those for the sum and xor benchmarks, just accentuated up to orders of magnitude. But scanning without ops does not seem relevant in practice --- please correct me if it is --- so I nixed those benchmarks.)

Analysis

Unit-triangular matrices

The UnitLowerTriangular case is more interesting, so let's begin there. The relevant getindex method from master is

getindex{T,S}(A::UnitLowerTriangular{T,S}, i::Integer, j::Integer) = i == j ? one(T) : (i > j ? A.data[i,j] : zero(A.data[j,i]))

This indexing method (master and tern in the benchmarks) involves two branches. Where element access is cheap, the branching kills performance; where element access is expensive, the branching is mostly negligible. To improve performance in the former case, we want to eliminate branching insofar as possible. So let's replace the ternary operators with ifelse:

ifelse_getindex{T,S}(A::UnitLowerTriangular{T,S}, i::Integer, j::Integer) = ifelse(i == j, one(T), ifelse(i > j, A.data[i,j], zero(A.data[j,i])))

This approach dramatically improves performance where element access is cheap. But it is a disaster where element access is expensive as eager ifelse argument evaluation results in two element accesses rather than one. Additionally, the reference to zero(A.data[j,i]) hypothetically adds a row-wise scan to the intended column-wise scan behind the scenes, a potential performance killer (--- though it seems the compiler is sufficiently intelligent to, in the Array{Float64} test case, reduce zero(A.data[j,i]) to zero(T) without element access. Hence in the Array{Float64} test case this performance issue and the impact of attempts to mitigate it are not evident.)

There are multiple paths forward from this point. One path is to somehow distinguish the two element-access-cost cases and provide suitable methods for each case, but that path seems troublesome in practice. Another path involves broader changes to the triangular types, discussed further in the section on generic programming below. Concerning more immediate partial solutions, here are two: The first partial solution preserves behavior, the second does not.

The behavior-preserving partial solution is mixing the ternary operator and ifelse approaches to getindex:

mixro_getindex{T,S}(A::UnitLowerTriangular{T,S}, i::Integer, j::Integer) = i > j ? A.data[i,j] : ifelse(i == j, one(T), zero(A.data[j,i]))

This approach, mixro in the benchmarks, has a few merits. First, this approach requires at most one element access, restoring performance where element access is expensive. (Actually this approach is slightly faster than master in the expensive-element-access case, perhaps due to avoiding the second branch, but the difference is marginal.) Second, removing the second ternary operator recovers most, but not all, of the ifelse approach's performance where element access is cheap. (I'm guessing that is more due to better branch prediction than simply removing the second branch?) So all around this approach realizes a substantial improvement over master while preserving behavior. Should I prepare a PR?

The second (non-behavior-preserving) partial solution is replacing zero(A.data[j,i]) with zero(T) where i < j in getindex, mimicking the one(T) return where i == j. For example, augmenting the mixro approach above with this change yields mixroz in the benchmarks

mixroz_getindex{T,S}(A::UnitLowerTriangular{T,S}, i::Integer, j::Integer) = i > j ? A.data[i,j] : ifelse(i == j, one(T), zero(T))

The benchmarks contain such a zero(A.data[i,j]) -> zero(T) variant for each method. But with the second element access removed altogether, one might as well use ifelse, yielding ifelsez in the benchmarks

ifelsez_getindex{T,S}(A::UnitLowerTriangular{T,S}, i::Integer, j::Integer) = ifelse(i == j, one(T), ifelse(i > j, A.data[i,j], zero(T)))

which exhibits the best overall performance among the benchmarked methods. (Only ternz-like methods beat it marginally in the sumall and xorall benchmarks involving expensive element access: those methods avoid an unnecessary element access when scanning over the strict upper triangle. In any case, both mixroz and ternz-like approaches nicely outperform master.) The caveat: This change brings up generic linear algebra design issues; see the dedicated section below for more.

Non-unit triangular matrices

For the non-unit case, the relevant getindex method from master (master and tern in the benchmarks) is

getindex{T,S}(A::LowerTriangular{T,S}, i::Integer, j::Integer) = i >= j ? A.data[i,j] : zero(A.data[j,i])

Here an ifelse method

ifelse_getindex{T,S}(A::LowerTriangular{T,S}, i::Integer, j::Integer) = ifelse(i >= j, A.data[i,j], zero(A.data[j,i]))

again dramatically improves performance where element access is cheap, and worsens performance by roughly a factor of two where element access is expensive. The obvious solution is replacing zero(A.data[j,i]) with zero(T) as above

ifelsez_getindex{T,S}(A::LowerTriangular{T,S}, i::Integer, j::Integer) = ifelse(i >= j, A.data[i,j], zero(T))

That modification again restores performance where element access is expensive. But the caveats regarding generic linear algebra design mentioned above of course apply here as well; see the dedicated section below for more.

So unlike in the unit-triangular case, here behavior-preserving changes that close the performance gap to direct indexing of the underlying data structure are not obvious.

zero(T) and generic linear algebra design issues

Changing zero(A.data[j,i]) to zero(T) in the getindex methods for

[Unit](Lower|Upper)Triangular{T,S<:AbstractMatrix{T}}

types implicitly restricts S to subtypes of AbstractMatrix{T} for which zero(T) exists. In the case of Unit(Lower|Upper)Triangular types, this restriction already exists with one(T). This restriction leads to failures when, for example, S is almost any array type, Matrix included. To illustrate,

import Base.LinAlg.UnitLowerTriangular
A = Array(Matrix{Float64}, 3, 3)
fill!(A, eye(Float64, 2, 2))
ultA = UnitLowerTriangular(A)
ultA[1,1]

throws a MethodError: 'one' has no method matching one(::Type{Array{Float64,2}}). So given this present state changing zero(A.data[j,i]) to zero(T) doesn't seem so bad, but the present state aside:

At first glance, solving this issue in the case of one(T) seems straightforward: Generalize one to something that returns an appropriate multiplicative identity. Where T is an array type, for which returning an explicit identity from one is not possible given that type T does not encode matrix size, an abstract multiplicative identity object could work. But cogently defining the behavior of such an object generally may be tricky. Furthermore, performance implications may exist, for example potentially requiring expensive checks along the lines of ismultiplicativeidentityoftype(T,x) in setindex (which presently performs comparisons x == 0 and x == 1, presenting genericity issues in their own right). Lots of thorns rapidly appear. I imagine sidestepping this issue with zero(T) was the motivation for using zero(A.data[j,i]), but the zero(A.data[j,i]) approach clearly has plenty of vices alongside its virtues.

Abstracting and distilling the problem:

In JuliaLang/julia#8240, @jiahao makes a beautiful distinction between 'storage' and 'label' types, where label types simply annotate the contents of a wrapped storage type. The present triangular types are something else: neither storage nor merely a label, they modify the apparent contents of the wrapped storage. Making these triangular types work generically seems to require that both 1 and 0 be well defined both abstractly and concretely for all potential element types of triangular types. As above, that seems tricky at best; call this approach (1).

Two potential alternatives: (2) Explicitly restrict triangular types to element types for which one and zero are both well defined, for example Numbers, making no guarantees about the generality of triangular types. (3) Make triangular types strictly labels, and provide access methods which modify the apparent contents of the wrapped storage type only where necessary (for example with packed triangular storage, similar to the present treatment of some special matrix types.) This approach (3) appears to have advantages in simplicity, consistency, performance, and in some sense generality as well. (1) and (2) certainly have merits though, particularly generality in a different sense.

</War and Peace> Thanks for reading! Thoughts?

@tkelman
Copy link

tkelman commented Dec 27, 2015

The partial solution for UnitTriangular sounds worth trying with few downsides AFAICT. It would make sense to me to do that change as a PR and point to some of the supporting benchmarks here, it sounds like you've been pretty thorough so far. Try to be more concise on this kind of thing though, you're asking for a bit of an investment from anyone to read through all the details here.

Absent traits or somewhat unusual method-exists checks upon construction, I don't see how you would go about explicitly restricting the element types for the triangular wrappers.

Do you have a concrete proposal on 3, or what do you mean by "access methods which modify the apparent contents of the wrapped storage type" ?

@Sacha0
Copy link
Member Author

Sacha0 commented Dec 27, 2015

The partial solution for UnitTriangular sounds worth trying with few downsides AFAICT. It would make sense to me to do that change as a PR and point to some of the supporting benchmarks here, it sounds like you've been pretty thorough so far.

Cheers. I will prepare a PR. Architecture dependence might be an issue. To facilitate broader testing I will include a stripped-down benchmark.

Try to be more concise on this kind of thing though, you're asking for a bit of an investment from anyone to read through all the details here.

Absolutely, and thanks :). Definitely a bit long, hence the </War and Peace>. I am often torn between [concise but potentially incomplete and/or opaque] and [clear and complete but long]; suggestions for how I might strike a better balance would be much appreciated. I added a note at the top making explicit my hope that readers skip much of the content included for completeness.

Absent traits or somewhat unusual method-exists checks upon construction, I don't see how you would go about explicitly restricting the element types for the triangular wrappers.

Agreed. I was thinking documentation of the implicit restriction for now, were that route taken.

Do you have a concrete proposal on 3, or what do you mean by "access methods which modify the apparent contents of the wrapped storage type" ?

Clarification: Where A is UnitLowerTriangular and the existing implementation functions, getindex(A, i, j) returns an appropriate zero where i < j and an appropriate one where i == j, independent of the underlying data structure's contents. This behavior is what I meant by "access methods which modify the apparent contents of the wrapped storage type". This behavior is great. If it can be made to work consistently, generically, and without sacrificing too much performance, it's probably ideal.

But if not, removing the getindex/setindex magic from triangular types might be an advantageous compromise. By that I mean having A[i,j] = A.data[i,j] independent of i and j. Then triangular types reduce to @jiahao's 'label's --- types which annotate the contents of storage types for the purposes of dispatch and nothing more. Behaves consistently and in some sense generically, doesn't have e.g. issues with block-matrix constructions, avoids performance gotchas. But much less nice than the existing approach in some respects: Less safe, more burden on the user.

Of course supporting both approaches through different types would be possible but seems suboptimal.

To be clear, I would prefer to see the existing approach work. It's really elegant. I'm just not certain whether it can be made to work consistently, generically, and without sacrificing too much performance, and if so how. But I'd love to find a way and am hoping better minds than mine will dream up solutions.

@tkelman
Copy link

tkelman commented Dec 27, 2015

By that I mean having A[i,j] = A.data[i,j] independent of i and j

Yikes. The getindex "magic" as you call it is kind of the point of what the labels are labeling.

My personal hypothesis on this and most other #136-related issues is that getindex on a row,column pair is usually the wrong question to be asking for structured matrix types. It should work and be supported, but I don't think it matters so much if it's slow. If you know you have a specific structure, you can write code that iterates over the wrapped data according to that structure rather than in blind dense cartesian iteration. What's missing is a generic way of iterating over structure.

@Sacha0
Copy link
Member Author

Sacha0 commented Dec 27, 2015

Yikes. The getindex "magic" as you call it is kind of the point of what the labels are labeling.

My personal hypothesis on this and most other #136-related issues is that getindex on a row,column pair is usually the wrong question to be asking for structured matrix types. It should work and be supported, but I don't think it matters so much if it's slow. If you know you have a specific structure, you can write code that iterates over the wrapped data according to that structure rather than in blind dense cartesian iteration. What's missing is a generic way of iterating over structure.

I wholeheartedly concur and as above prefer the existing approach. The indexing performance issue doesn't concern me much for exactly the reasons you cite. Rather making this

A = Matrix{Matrix{Float64}}(2,2)
A[1,1] = A[2,2] = eye(2)
A[1,2] = A[2,1] = zeros(2,2)
UnitLowerTriangular(A)[1,1]

work is what catches my attention. My thoughts above about abstract multiplicative and additive identities are all I have so far though. I look forward to more on your thoughts about efficient iteration over structured matrices :).

@Sacha0
Copy link
Member Author

Sacha0 commented Dec 27, 2015

Architecture dependence might be an issue.

Updated the gists with results on a second architecture. Same trends.

@andreasnoack
Copy link
Member

Thanks for the details. I was about to ask you some of this in JuliaLang/julia#14493. @tkelman has already said most of what I intended to say. In performance critical code for XTriangular matrices you wouldn't use the general getindex methods, but instead, loop through the non-zero part. Preferably, with fancy iterators when we get them.

However, we should still optimize the getindex methods whenever possible as you do in JuliaLang/julia#14493, but without sacrificing generic programming by using T instead of the value. As pointed out in JuliaLang/julia#14493, this probably won't hurt performance for most data types anyway.

@Sacha0
Copy link
Member Author

Sacha0 commented Dec 28, 2015

but without sacrificing generic programming by using T instead of the value

More to think about on that front: The zero(A.data[j,i]) approach does not handle block-matrix constructions containing rectangular matrices gracefully. To illustrate,

A = Matrix{Matrix{Float64}}(2,2)
A[1,1] = eye(2)
A[2,2] = eye(3)
A[1,2] = zeros(2, 3)
A[2, 1] = zeros(3, 2)
UnitLowerTriangular(A)[1,2]

returns a three-by-two zero block on master.

In fact, that test reveals a problem with JuliaLang/julia#14493. Any reference outside the strict lower triangle requires evaluation of ifelse(i == j, one(T), zero(A.data[j,i])), which eagerly evaluates one(T). But one(T) fails for block-matrix constructions, so getindex fails for all elements outside the strict lower triangle in block-matrix constructions with JuliaLang/julia#14493, whereas master only fails on the diagonal (the nonsensical behavior highlighted above in the rectangular case aside.) So in lieu of a solution for the one(T) problem, master's getindex methods might be safer than JuliaLang/julia#14493 where generic linear algebra is concerned.

Another potential approach to the zero(T)/one(T) problem: Replace one(T) with one(A.data[i,i]) and zero(A.data[j,i]) with zero(A.data[i,j]). Doing so would allow the user to resolve the ambiguity at the heart of this issue where necessary: Where one(T) and zero(T) are well defined, things would work as they do now without user intervention. But where one(T) and zero(T) are not well defined, for example in block-matrix constructions, the user could define (the necessary aspects of) the elements they want to touch and things would then work as expected. For example, if I wanted a two-by-two matrix on the diagonal of a block-matrix construction, I could define A[k,k] = Matrix{T}(2,2) and that element would then be functional within triangular types. The example just above would work, as would the example further above in #294.

I see two potential downsides to this approach: (1) This approach would require allocating nominally-unused storage for the diagonal and zero half of the triangular object (though only if those elements need to be touched). But the approach does allow two triangular objects to pack into the same underlying square container in the traditional way, mitigating this downside. In any case this downside (sometimes 2x storage, often par for the course with triangular matrices anyway) seems negligible relative to the upside (things which should work, do). (2) Packed triangular storage formats would require special handling. But then they already do. So that doesn't seem so problematic either.

Am I missing obvious pitfalls?

@Sacha0
Copy link
Member Author

Sacha0 commented Dec 31, 2015

I briefly tested the approach outlined in the preceding comment. All linalg tests pass, and the two block-matrix snippets above work immediately. I can prepare a demo PR if this seems interesting.

@KristofferC KristofferC transferred this issue from JuliaLang/julia Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

No branches or pull requests

5 participants