-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster implementation of sum #5205
Conversation
This means that ranges between numbers bigger than `typemax(Int)` work now, but range lengths are still limited to up to `typemax(Int)`.
OS X doesn't have `sha1sum` available, so the fallback used to get the initial seed whenever /dev/urandom isn't available doesn't work.
It passed Travis's gcc compilation. The clang failure doesn't seem to be directly pertinent. |
Is the correct breakpoint for unrolling an architecture-dependent thing? |
Generally, it depends, to some extent, on architecture. Four-way unrolling is a very conservative unrolling that won't cause problems in most modern CPUs. I have tried this trick when I was writing C++ codes a couple years ago. You can see performance improvement even using 16-way unrolling. |
Cool; I was just wondering (and thinking ahead to Julia-on-ARM). |
I curious as to what CPU model you get a 2x speedup from unrolling. On an older intel core 2 duo laptop unrolling only gives a ~7% speedup. Unrolling 8x or 16x does not give any performance improvements over the 4x version. |
I am doing this on 3.4 GHz Intel Core i7. The effectiveness of this micro-optimization depends largely on context. However, with such conservative unrolling, it shouldn't hurt in most cases. |
Its pretty crazy that it makes such a big difference over just a couple chip generations. |
I got this number by running test/benchmark_reduce.jl in NumericExtensions. I am just migrating the code there. |
If we could get LLVM's loop vectorization pass to work, it should do partial unrolling automatically (edit: link). (But until then, I endorse this solution.) |
@simonster That would be a more elegant solution. I think this is just a stopgap. |
I actually lean towards holding off this PR for a while, and use the |
This might be considered as a first step towards a faster sum. But frankly, I am not super excited about this. Without SIMD, the improvement is quite limited. I managed to get a sum function in a C++ library that is 10x faster than a simple C for loop (using AVX + 4x unrolling), that actually squeezes every bit of performance that a CPU can deliver. Here is the (core skeleton) of the code: https://github.com/lindahua/light-matrix/blob/master/light_mat/mateval/internal/mat_fold_internal.h But I think LLVM auto-vectorization should be a better way in a long run. |
I think it is worth putting in effort, and with compiler improvements, we can always revert to the simpler version. |
Conflicts: base/gmp.jl base/mpfr.jl doc/stdlib/base.rst
Looks like the rebase screwed up things. I will make a cleaner PR. |
This implementation still uses pairwise summation, which accumulates the sequential sum of small blocks.
The sequential sum uses a faster implementation, where the sum is unrolled into four way accumulation. This not only makes more effective use of the instruction pipeline, but also allows larger (4x) block size while achieving the same level of accuracy.