-
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
more efficient and generic indexing through triangular types? #294
Comments
The partial solution for UnitTriangular sounds worth trying with few downsides AFAICT. It would make sense to me to do that change as a PR and point to some of the supporting benchmarks here, it sounds like you've been pretty thorough so far. Try to be more concise on this kind of thing though, you're asking for a bit of an investment from anyone to read through all the details here. Absent traits or somewhat unusual method-exists checks upon construction, I don't see how you would go about explicitly restricting the element types for the triangular wrappers. Do you have a concrete proposal on 3, or what do you mean by "access methods which modify the apparent contents of the wrapped storage type" ? |
Cheers. I will prepare a PR. Architecture dependence might be an issue. To facilitate broader testing I will include a stripped-down benchmark.
Absolutely, and thanks :). Definitely a bit long, hence the
Agreed. I was thinking documentation of the implicit restriction for now, were that route taken.
Clarification: Where But if not, removing the Of course supporting both approaches through different types would be possible but seems suboptimal. To be clear, I would prefer to see the existing approach work. It's really elegant. I'm just not certain whether it can be made to work consistently, generically, and without sacrificing too much performance, and if so how. But I'd love to find a way and am hoping better minds than mine will dream up solutions. |
Yikes. The getindex "magic" as you call it is kind of the point of what the labels are labeling. My personal hypothesis on this and most other #136-related issues is that |
I wholeheartedly concur and as above prefer the existing approach. The indexing performance issue doesn't concern me much for exactly the reasons you cite. Rather making this A = Matrix{Matrix{Float64}}(2,2)
A[1,1] = A[2,2] = eye(2)
A[1,2] = A[2,1] = zeros(2,2)
UnitLowerTriangular(A)[1,1] work is what catches my attention. My thoughts above about abstract multiplicative and additive identities are all I have so far though. I look forward to more on your thoughts about efficient iteration over structured matrices :). |
Updated the gists with results on a second architecture. Same trends. |
Thanks for the details. I was about to ask you some of this in JuliaLang/julia#14493. @tkelman has already said most of what I intended to say. In performance critical code for However, we should still optimize the |
More to think about on that front: The A = Matrix{Matrix{Float64}}(2,2)
A[1,1] = eye(2)
A[2,2] = eye(3)
A[1,2] = zeros(2, 3)
A[2, 1] = zeros(3, 2)
UnitLowerTriangular(A)[1,2] returns a three-by-two zero block on master. In fact, that test reveals a problem with JuliaLang/julia#14493. Any reference outside the strict lower triangle requires evaluation of Another potential approach to the I see two potential downsides to this approach: (1) This approach would require allocating nominally-unused storage for the diagonal and zero half of the triangular object (though only if those elements need to be touched). But the approach does allow two triangular objects to pack into the same underlying square container in the traditional way, mitigating this downside. In any case this downside (sometimes 2x storage, often par for the course with triangular matrices anyway) seems negligible relative to the upside (things which should work, do). (2) Packed triangular storage formats would require special handling. But then they already do. So that doesn't seem so problematic either. Am I missing obvious pitfalls? |
I briefly tested the approach outlined in the preceding comment. All linalg tests pass, and the two block-matrix snippets above work immediately. I can prepare a demo PR if this seems interesting. |
In JuliaLang/julia#14471, eliminating indexing through triangular types significantly improved performance. Benchmarks and analysis investigating this observation follow below. The analysis led down a generic linear algebra rabbit hole related to JuliaLang/julia#8240 and #136, discussed towards the end.
tl;dr Indexing through a triangular type is substantially slower than directly indexing the underlying data structure in some cases. Behavior-preserving, likely-uncontroversial changes to relevant
getindex
methods close a good portion of the performance gap. Further improvements appear to require non-behavior-preserving changes which are tied to broader design issues.Edit: This post is long. Please don't read all of it. Each section stands alone. Skip the Benchmarks section by default; consider it an appendix.
Benchmarks
See this gist file for the benchmark code, this gist file for the
LowerTriangular
results, and this gist file theUnitLowerTriangular
results;UpperTriangular
behavior should be identical to that ofLowerTriangular
, and likewise forUnitUpperTriangular
withUnitLowerTriangular
.The four benchmarks mimic textbook level-2 BLAS access patterns: (1)
sumlt
scans column-wise over the lower-triangular half of a matrix, summing the matrix's elements into a reduction variable. (2)sumall
scans column-wise over the entirety of a matrix, summing the matrix's elements into a reduction variable. (3)xorlt
scans column-wise over the lower-triangular half of a matrix, xor'ing the matrix's elements into a reduction variable. (4)xorall
scans column-wise over the entirety of a matrix, xor'ing the matrix's elements into a reduction variable. In the sum benchmarks, matrix elements areFloat64
s. In the xor benchmarks, matrix elements areInt64
s.The benchmarks that scan over the lower-triangular half of a matrix reflect the behavior of algorithms that take advantage of triangular structure, whereas those benchmarks that scan over the entire matrix reflect the behavior of algorithms that ignore the triangular structure. The sum and xor benchmarks generally exhibit the same trends. But xor being simpler than floating-point addition, the xor benchmarks exhibit the trends more sharply than the sum benchmarks.
Two broad classes of matrix types underlying the triangular types are relevant here: (1) matrix types for which element access is relatively costly ('expensive element access' below); and (2) matrix types for which element access is relatively inexpensive ('cheap element access' below). As a model of the former the benchmarks test
SparseMatrixCSC
s, and as a model of the latter the benchmarks testArray
s; the cost of branching and simple operations are mostly irrelevant in scanning element-by-element over the former, whereas such costs are important in scanning element-by-element over the latter.For comparison, each benchmark contains a
dref_
method which directly indexes the underlying data structure. Though in thesumall
andxorall
benchmarks this method produces incorrect results, it remains useful for comparison.(I also benchmarked simple scans --- the equivalent of the sum and xor benchmarks without the sum or xor for each element access, only assignment to the reduction variable. Trends in those benchmarks were similar to those for the sum and xor benchmarks, just accentuated up to orders of magnitude. But scanning without ops does not seem relevant in practice --- please correct me if it is --- so I nixed those benchmarks.)
Analysis
Unit-triangular matrices
The
UnitLowerTriangular
case is more interesting, so let's begin there. The relevantgetindex
method from master isThis indexing method (
master
andtern
in the benchmarks) involves two branches. Where element access is cheap, the branching kills performance; where element access is expensive, the branching is mostly negligible. To improve performance in the former case, we want to eliminate branching insofar as possible. So let's replace the ternary operators withifelse
:This approach dramatically improves performance where element access is cheap. But it is a disaster where element access is expensive as eager
ifelse
argument evaluation results in two element accesses rather than one. Additionally, the reference tozero(A.data[j,i])
hypothetically adds a row-wise scan to the intended column-wise scan behind the scenes, a potential performance killer (--- though it seems the compiler is sufficiently intelligent to, in theArray{Float64}
test case, reducezero(A.data[j,i])
tozero(T)
without element access. Hence in theArray{Float64}
test case this performance issue and the impact of attempts to mitigate it are not evident.)There are multiple paths forward from this point. One path is to somehow distinguish the two element-access-cost cases and provide suitable methods for each case, but that path seems troublesome in practice. Another path involves broader changes to the triangular types, discussed further in the section on generic programming below. Concerning more immediate partial solutions, here are two: The first partial solution preserves behavior, the second does not.
The behavior-preserving partial solution is mixing the ternary operator and
ifelse
approaches togetindex
:This approach,
mixro
in the benchmarks, has a few merits. First, this approach requires at most one element access, restoring performance where element access is expensive. (Actually this approach is slightly faster than master in the expensive-element-access case, perhaps due to avoiding the second branch, but the difference is marginal.) Second, removing the second ternary operator recovers most, but not all, of theifelse
approach's performance where element access is cheap. (I'm guessing that is more due to better branch prediction than simply removing the second branch?) So all around this approach realizes a substantial improvement over master while preserving behavior. Should I prepare a PR?The second (non-behavior-preserving) partial solution is replacing
zero(A.data[j,i])
withzero(T)
wherei < j
ingetindex
, mimicking theone(T)
return wherei == j
. For example, augmenting themixro
approach above with this change yieldsmixroz
in the benchmarksThe benchmarks contain such a
zero(A.data[i,j]) -> zero(T)
variant for each method. But with the second element access removed altogether, one might as well useifelse
, yieldingifelsez
in the benchmarkswhich exhibits the best overall performance among the benchmarked methods. (Only
ternz
-like methods beat it marginally in thesumall
andxorall
benchmarks involving expensive element access: those methods avoid an unnecessary element access when scanning over the strict upper triangle. In any case, bothmixroz
andternz
-like approaches nicely outperform master.) The caveat: This change brings up generic linear algebra design issues; see the dedicated section below for more.Non-unit triangular matrices
For the non-unit case, the relevant
getindex
method from master (master
andtern
in the benchmarks) isHere an
ifelse
methodagain dramatically improves performance where element access is cheap, and worsens performance by roughly a factor of two where element access is expensive. The obvious solution is replacing
zero(A.data[j,i])
withzero(T)
as aboveThat modification again restores performance where element access is expensive. But the caveats regarding generic linear algebra design mentioned above of course apply here as well; see the dedicated section below for more.
So unlike in the unit-triangular case, here behavior-preserving changes that close the performance gap to direct indexing of the underlying data structure are not obvious.
zero(T)
and generic linear algebra design issuesChanging
zero(A.data[j,i])
tozero(T)
in thegetindex
methods fortypes implicitly restricts
S
to subtypes ofAbstractMatrix{T}
for whichzero(T)
exists. In the case ofUnit(Lower|Upper)Triangular
types, this restriction already exists withone(T)
. This restriction leads to failures when, for example,S
is almost any array type,Matrix
included. To illustrate,throws a
MethodError: 'one' has no method matching one(::Type{Array{Float64,2}})
. So given this present state changingzero(A.data[j,i])
tozero(T)
doesn't seem so bad, but the present state aside:At first glance, solving this issue in the case of
one(T)
seems straightforward: Generalizeone
to something that returns an appropriate multiplicative identity. WhereT
is an array type, for which returning an explicit identity fromone
is not possible given that typeT
does not encode matrix size, an abstract multiplicative identity object could work. But cogently defining the behavior of such an object generally may be tricky. Furthermore, performance implications may exist, for example potentially requiring expensive checks along the lines ofismultiplicativeidentityoftype(T,x)
insetindex
(which presently performs comparisonsx == 0
andx == 1
, presenting genericity issues in their own right). Lots of thorns rapidly appear. I imagine sidestepping this issue withzero(T)
was the motivation for usingzero(A.data[j,i])
, but thezero(A.data[j,i])
approach clearly has plenty of vices alongside its virtues.Abstracting and distilling the problem:
In JuliaLang/julia#8240, @jiahao makes a beautiful distinction between 'storage' and 'label' types, where
label
types simply annotate the contents of a wrapped storage type. The present triangular types are something else: neither storage nor merely a label, they modify the apparent contents of the wrapped storage. Making these triangular types work generically seems to require that both1
and0
be well defined both abstractly and concretely for all potential element types of triangular types. As above, that seems tricky at best; call this approach (1).Two potential alternatives: (2) Explicitly restrict triangular types to element types for which
one
andzero
are both well defined, for exampleNumber
s, making no guarantees about the generality of triangular types. (3) Make triangular types strictly labels, and provide access methods which modify the apparent contents of the wrapped storage type only where necessary (for example with packed triangular storage, similar to the present treatment of some special matrix types.) This approach (3) appears to have advantages in simplicity, consistency, performance, and in some sense generality as well. (1) and (2) certainly have merits though, particularly generality in a different sense.</War and Peace>
Thanks for reading! Thoughts?The text was updated successfully, but these errors were encountered: