Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

more efficient getindex methods for unit triangular types #14493

Merged
merged 1 commit into from
Mar 7, 2016

Conversation

Sacha0
Copy link
Member

@Sacha0 Sacha0 commented Dec 27, 2015

This pull request provides behaviorally-identical but more efficient getindex methods for Unit(Lower|Upper)Triangular types. See JuliaLang/LinearAlgebra.jl#294 for details. I've benchmarked these methods on Intel Ivy Bridge and AMD Piledriver machines; benchmarking on additional architectures might be wise. The benchmarks below, a subset of those in JuliaLang/LinearAlgebra.jl#294,

import Base.LinAlg.UnitLowerTriangular
import Base.LinAlg.UnitUpperTriangular

# Define this PR's getindex methods
altgetindex{T,S}(A::UnitLowerTriangular{T,S}, i::Integer, j::Integer) = i > j ? A.data[i,j] : ifelse(i == j, one(T), zero(A.data[j,i]))
altgetindex{T,S}(A::UnitUpperTriangular{T,S}, i::Integer, j::Integer) = i < j ? A.data[i,j] : ifelse(i == j, one(T), zero(A.data[j,i]))

# Benchmark functions using master getindex
function master_sumlt(A::UnitLowerTriangular, n, s)
    @inbounds for j in 1:n, i in j:n
        s += getindex(A, i, j)
    end
    s
end
function master_sumall(A::UnitLowerTriangular, n, s)
    @inbounds for j in 1:n, i in 1:n
        s += getindex(A, i, j)
    end
    s
end
# Benchmark functions using this PR's getindex
function alt_sumlt(A::UnitLowerTriangular, n, s)
    @inbounds for j in 1:n, i in j:n
        s += altgetindex(A, i, j)
    end
    s
end
function alt_sumall(A::UnitLowerTriangular, n, s)
    @inbounds for j in 1:n, i in 1:n
        s += altgetindex(A, i, j)
    end
    s
end

# Benchmarks proper
using Benchmarks

function prettytimes(res)
    # based on Benchmarks.pretty_time_string
    stats = Benchmarks.SummaryStatistics(res)
    timecenter = stats.elapsed_time_center
    timelower = get(stats.elapsed_time_lower)
    timeupper = get(stats.elapsed_time_upper)
    timecenter < 1_000.0 ? (scalefactor = 1.0; units = "ns") :
        timecenter < 1_000_000.0 ? (scalefactor = 1_000.0; units = "μs") :
            timecenter < 1_000_000_000.0 ? (scalefactor = 1_000_000.0; units = "ms") :
                (scalefactor = 1_000_000_000.0; units = " s")
    @sprintf("%6.2f %s [%6.2f,%6.2f]", timecenter/scalefactor, units, timelower/scalefactor, timeupper/scalefactor)
end

matsize = 1000;
ultdensemat = UnitLowerTriangular(rand(matsize, matsize));
ultsparsemat = UnitLowerTriangular(sprand(matsize, matsize, 0.05));

wm, wt = 10, 25
println(" $(lpad("[branch]", wm)) $(lpad("sumlt/dense", wt)), $(lpad("sumall/dense", wt)) | $(lpad("sumlt/sparse", wt)), $(lpad("sumall/sparse", wt))")
@printf("%s: %s, %s | %s, %s\n",
    lpad("master", wm),
    prettytimes(@benchmark master_sumlt(ultdensemat, matsize, 0.0)),
    prettytimes(@benchmark master_sumall(ultdensemat, matsize, 0.0)),
    prettytimes(@benchmark master_sumlt(ultsparsemat, matsize, 0.0)),
    prettytimes(@benchmark master_sumall(ultsparsemat, matsize, 0.0)) )
@printf("%s: %s, %s | %s, %s\n",
    lpad("this PR", wm),
    prettytimes(@benchmark alt_sumlt(ultdensemat, matsize, 0.0)),
    prettytimes(@benchmark alt_sumall(ultdensemat, matsize, 0.0)),
    prettytimes(@benchmark alt_sumlt(ultsparsemat, matsize, 0.0)),
    prettytimes(@benchmark alt_sumall(ultsparsemat, matsize, 0.0)) )

yield

[branch]               sumlt/dense,              sumall/dense |              sumlt/sparse,             sumall/sparse
 master:   2.58 ms [  2.33,  2.83],   5.24 ms [  4.86,  5.61] |  12.29 ms [ 11.55, 13.04],  31.49 ms [ 30.51, 32.47]
this PR: 739.79 μs [699.44,780.14],   1.39 ms [  1.30,  1.48] |  10.64 ms [  9.99, 11.28],  28.13 ms [ 26.94, 29.33]   

on the Ivy Bridge machine, and

[branch]               sumlt/dense,              sumall/dense |              sumlt/sparse,             sumall/sparse
 master:   4.17 ms [  4.13,  4.22],   8.28 ms [  8.19,  8.36] |  15.84 ms [ 15.74, 15.93],  38.27 ms [ 38.03, 38.52]
this PR:   1.13 ms [  1.10,  1.16],   2.15 ms [  2.10,  2.20] |  13.00 ms [ 12.91, 13.08],  33.55 ms [ 33.41, 33.68]

on the Piledriver machine. Best! (Edit: xor -> sum)

@kshyatt kshyatt added performance Must go faster linear algebra Linear algebra labels Dec 27, 2015
@tkelman
Copy link
Contributor

tkelman commented Dec 28, 2015

The error on appveyor looks real, an UndefRefError in linalg/triangular? https://ci.appveyor.com/project/StefanKarpinski/julia/build/1.0.12133/job/v9kpyew4v3g1fmr0

@andreasnoack
Copy link
Member

Great. Thanks for the detailed timings. My rule of thumb is that a memory load is more expensive than the branch, but I wasn't aware that the memory load in zero(A.data[i,j]) was optimized away.

I'm not sure what is causing the error on AppVeyor.

@Sacha0
Copy link
Member Author

Sacha0 commented Dec 28, 2015

I wasn't aware that the memory load in zero(A.data[i,j]) was optimized away.

That optimization surprised me --- pretty clever!

I'm not sure what is causing the error on AppVeyor.

Neither I. Everything passes locally and the offending code (b1a2d54/test/linalg/triangular.jl#L155) looks innocuous. Is debug-play in the build environment possible, without repeatedly pushing new branches? Does MPFR behave differently on Windows?

@tkelman
Copy link
Contributor

tkelman commented Dec 28, 2015

There's a way you can access build workers remotely over rdp, see http://www.appveyor.com/docs/how-to/rdp-to-build-worker - and everything should run equivalently if you turn appveyor on for your fork, may have to temporarily add branch names to the whitelist in appveyor.yml if you do it that way though.

Does MPFR behave differently on Windows?

Possibly. C long is 32 bits on Win64, unlike most other 64 bit systems.

Also note your links to specific lines of code aren't coming out quite right - if you paste the entire url, as in

@test B == A1
, github will hide the https://github.com/JuliaLang/julia/blob/ part automatically. apparently not? strange

@Sacha0
Copy link
Member Author

Sacha0 commented Dec 28, 2015

There's a way you can access build workers remotely over rdp, see http://www.appveyor.com/docs/how-to/rdp-to-build-worker - and everything should run equivalently if you turn appveyor on for your fork, may have to temporarily add branch names to the whitelist in appveyor.yml if you do it that way though.

Great, thanks! I'll check that out if the potential issue with this PR described in JuliaLang/LinearAlgebra.jl#294 gets resolved.

andreasnoack added a commit that referenced this pull request Mar 7, 2016
more efficient getindex methods for unit triangular types
@andreasnoack andreasnoack merged commit ffa7e7b into JuliaLang:master Mar 7, 2016
@Sacha0 Sacha0 deleted the utgetindex branch June 14, 2016 18:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
linear algebra Linear algebra performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants