Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster implementation of sum #5205

Closed
wants to merge 20 commits into from
Closed

Faster implementation of sum #5205

wants to merge 20 commits into from

Conversation

lindahua
Copy link
Contributor

This implementation still uses pairwise summation, which accumulates the sequential sum of small blocks.

The sequential sum uses a faster implementation, where the sum is unrolled into four way accumulation. This not only makes more effective use of the instruction pipeline, but also allows larger (4x) block size while achieving the same level of accuracy.

@lindahua
Copy link
Contributor Author

It passed Travis's gcc compilation. The clang failure doesn't seem to be directly pertinent.

@pao
Copy link
Member

pao commented Dec 20, 2013

Is the correct breakpoint for unrolling an architecture-dependent thing?

@lindahua
Copy link
Contributor Author

Generally, it depends, to some extent, on architecture. Four-way unrolling is a very conservative unrolling that won't cause problems in most modern CPUs.

I have tried this trick when I was writing C++ codes a couple years ago. You can see performance improvement even using 16-way unrolling.

@pao
Copy link
Member

pao commented Dec 20, 2013

Cool; I was just wondering (and thinking ahead to Julia-on-ARM).

@jakebolewski
Copy link
Member

I curious as to what CPU model you get a 2x speedup from unrolling. On an older intel core 2 duo laptop unrolling only gives a ~7% speedup. Unrolling 8x or 16x does not give any performance improvements over the 4x version.

@lindahua
Copy link
Contributor Author

I am doing this on 3.4 GHz Intel Core i7.

The effectiveness of this micro-optimization depends largely on context. However, with such conservative unrolling, it shouldn't hurt in most cases.

@jakebolewski
Copy link
Member

Its pretty crazy that it makes such a big difference over just a couple chip generations.

@lindahua
Copy link
Contributor Author

I got this number by running test/benchmark_reduce.jl in NumericExtensions. I am just migrating the code there.

@simonster
Copy link
Member

If we could get LLVM's loop vectorization pass to work, it should do partial unrolling automatically (edit: link). (But until then, I endorse this solution.)

@lindahua
Copy link
Contributor Author

@simonster That would be a more elegant solution. I think this is just a stopgap.

@lindahua
Copy link
Contributor Author

I actually lean towards holding off this PR for a while, and use the sum (among other functions) as a testbed to see how the LLVM auto-vectorization works.

@lindahua
Copy link
Contributor Author

This might be considered as a first step towards a faster sum. But frankly, I am not super excited about this. Without SIMD, the improvement is quite limited.

I managed to get a sum function in a C++ library that is 10x faster than a simple C for loop (using AVX + 4x unrolling), that actually squeezes every bit of performance that a CPU can deliver.

Here is the (core skeleton) of the code: https://github.com/lindahua/light-matrix/blob/master/light_mat/mateval/internal/mat_fold_internal.h

But I think LLVM auto-vectorization should be a better way in a long run.

@ViralBShah
Copy link
Member

I think it is worth putting in effort, and with compiler improvements, we can always revert to the simpler version.

ViralBShah and others added 5 commits December 26, 2013 00:29
Conflicts:
	base/gmp.jl
	base/mpfr.jl
	doc/stdlib/base.rst
…oni-bigrng"

This reverts commit 4cc21ca, reversing
changes made to 4f9a2a9.
when passing a typevar with a concrete upper bound to inference, pass
the upper bound instead. this prevents confusion when later stages
believe something to be a concrete type and it's actually a typevar
instead.
@lindahua
Copy link
Contributor Author

Looks like the rebase screwed up things. I will make a cleaner PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants