Faster implementation of sum #5205

lindahua · 2013-12-20T20:31:02Z

This implementation still uses pairwise summation, which accumulates the sequential sum of small blocks.

The sequential sum uses a faster implementation, where the sum is unrolled into four way accumulation. This not only makes more effective use of the instruction pipeline, but also allows larger (4x) block size while achieving the same level of accuracy.

This means that ranges between numbers bigger than `typemax(Int)` work now, but range lengths are still limited to up to `typemax(Int)`.

OS X doesn't have `sha1sum` available, so the fallback used to get the initial seed whenever /dev/urandom isn't available doesn't work.

lindahua · 2013-12-20T20:40:35Z

It passed Travis's gcc compilation. The clang failure doesn't seem to be directly pertinent.

pao · 2013-12-20T20:58:54Z

Is the correct breakpoint for unrolling an architecture-dependent thing?

lindahua · 2013-12-20T21:11:08Z

Generally, it depends, to some extent, on architecture. Four-way unrolling is a very conservative unrolling that won't cause problems in most modern CPUs.

I have tried this trick when I was writing C++ codes a couple years ago. You can see performance improvement even using 16-way unrolling.

pao · 2013-12-20T21:26:37Z

Cool; I was just wondering (and thinking ahead to Julia-on-ARM).

jakebolewski · 2013-12-20T22:02:41Z

I curious as to what CPU model you get a 2x speedup from unrolling. On an older intel core 2 duo laptop unrolling only gives a ~7% speedup. Unrolling 8x or 16x does not give any performance improvements over the 4x version.

lindahua · 2013-12-20T22:07:04Z

I am doing this on 3.4 GHz Intel Core i7.

The effectiveness of this micro-optimization depends largely on context. However, with such conservative unrolling, it shouldn't hurt in most cases.

jakebolewski · 2013-12-20T22:11:26Z

Its pretty crazy that it makes such a big difference over just a couple chip generations.

lindahua · 2013-12-20T22:17:06Z

I got this number by running test/benchmark_reduce.jl in NumericExtensions. I am just migrating the code there.

simonster · 2013-12-20T22:31:06Z

If we could get LLVM's loop vectorization pass to work, it should do partial unrolling automatically (edit: link). (But until then, I endorse this solution.)

lindahua · 2013-12-20T22:55:03Z

@simonster That would be a more elegant solution. I think this is just a stopgap.

lindahua · 2013-12-20T22:57:50Z

I actually lean towards holding off this PR for a while, and use the sum (among other functions) as a testbed to see how the LLVM auto-vectorization works.

lindahua · 2013-12-21T01:25:40Z

This might be considered as a first step towards a faster sum. But frankly, I am not super excited about this. Without SIMD, the improvement is quite limited.

I managed to get a sum function in a C++ library that is 10x faster than a simple C for loop (using AVX + 4x unrolling), that actually squeezes every bit of performance that a CPU can deliver.

Here is the (core skeleton) of the code: https://github.com/lindahua/light-matrix/blob/master/light_mat/mateval/internal/mat_fold_internal.h

But I think LLVM auto-vectorization should be a better way in a long run.

ViralBShah · 2013-12-23T16:27:42Z

I think it is worth putting in effort, and with compiler improvements, we can always revert to the simpler version.

Conflicts: base/gmp.jl base/mpfr.jl doc/stdlib/base.rst

…oni-bigrng" This reverts commit 4cc21ca, reversing changes made to 4f9a2a9.

when passing a typevar with a concrete upper bound to inference, pass the upper bound instead. this prevents confusion when later stages believe something to be a concrete type and it's actually a typevar instead.

…h/fastsum

lindahua · 2013-12-25T22:52:02Z

Looks like the rebase screwed up things. I will make a cleaner PR.

jiahao and others added 11 commits November 17, 2013 19:24

Adds random number generation for BigInts from GMP

f72e0a0

Refactor BigInt RNG

678df94

Add a default BigRNG and the standard interface to seed BigRNGs

6b53cfc

Add BigFloat random number generation

2147bca

Improved support for random BigInt ranges

3549bb8

This means that ranges between numbers bigger than `typemax(Int)` work now, but range lengths are still limited to up to `typemax(Int)`.

Add normally-distributed BigFloats generation

7ef6733

srand() now seeds both RNGs

bf8c452

Use shasum instead of sha1sum in OS X

5cb8606

OS X doesn't have `sha1sum` available, so the fallback used to get the initial seed whenever /dev/urandom isn't available doesn't work.

Add documentation to the BigRNG

4bf4829

A faster sum implementation (using four-way accumulation)

ba8881c

better explanation of the micro-optimization of sum_seq

32a1b4e

lindahua mentioned this pull request Dec 20, 2013

experiment with llvm vectorization passes #4786

Closed

ViralBShah and others added 5 commits December 26, 2013 00:29

Merge branch 'bigrng' of github.com:andrioni/julia into andrioni-bigrng

4cc21ca

Conflicts: base/gmp.jl base/mpfr.jl doc/stdlib/base.rst

Merge branch 'master' of github.com:JuliaLang/julia

ef3bece

Revert "Merge branch 'bigrng' of github.com:andrioni/julia into andri…

1a8885b

…oni-bigrng" This reverts commit 4cc21ca, reversing changes made to 4f9a2a9.

enable verifyFunction in DEBUG builds

fc2088f

fix #5233

f424d0a

when passing a typevar with a concrete upper bound to inference, pass the upper bound instead. this prevents confusion when later stages believe something to be a concrete type and it's actually a typevar instead.

lindahua added 4 commits December 25, 2013 16:44

A faster sum implementation (using four-way accumulation)

a1507d5

better explanation of the micro-optimization of sum_seq

d3c1634

modified sum based on Steven's comments.

4c03d2d

Merge branch 'dh/fastsum' of https://github.com/lindahua/julia into d…

a63a93e

…h/fastsum

lindahua closed this Dec 25, 2013

stevengj mentioned this pull request May 3, 2016

performance regression in sum(a) #16185

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster implementation of sum #5205

Faster implementation of sum #5205

lindahua commented Dec 20, 2013

lindahua commented Dec 20, 2013

pao commented Dec 20, 2013

lindahua commented Dec 20, 2013

pao commented Dec 20, 2013

jakebolewski commented Dec 20, 2013

lindahua commented Dec 20, 2013

jakebolewski commented Dec 20, 2013

lindahua commented Dec 20, 2013

simonster commented Dec 20, 2013

lindahua commented Dec 20, 2013

lindahua commented Dec 20, 2013

lindahua commented Dec 21, 2013

ViralBShah commented Dec 23, 2013

lindahua commented Dec 25, 2013

Faster implementation of sum #5205

Faster implementation of sum #5205

Conversation

lindahua commented Dec 20, 2013

lindahua commented Dec 20, 2013

pao commented Dec 20, 2013

lindahua commented Dec 20, 2013

pao commented Dec 20, 2013

jakebolewski commented Dec 20, 2013

lindahua commented Dec 20, 2013

jakebolewski commented Dec 20, 2013

lindahua commented Dec 20, 2013

simonster commented Dec 20, 2013

lindahua commented Dec 20, 2013

lindahua commented Dec 20, 2013

lindahua commented Dec 21, 2013

ViralBShah commented Dec 23, 2013

lindahua commented Dec 25, 2013