Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

attempt faster randn implementation #5105

Closed
ViralBShah opened this issue Dec 11, 2013 · 26 comments
Closed

attempt faster randn implementation #5105

ViralBShah opened this issue Dec 11, 2013 · 26 comments
Labels
help wanted Indicates that a maintainer wants help on an issue or pull request performance Must go faster randomness Random number generation and the Random stdlib

Comments

@ViralBShah
Copy link
Member

Currently, our randn implementation is in C:

https://github.com/JuliaLang/julia/blob/master/deps/random/randmtzig.c

For the longest time, I have wanted to have a pure julia implementation, but back then, it was not quite fast enough for something so crucial. We have now come a long way. I feel that there is a chance today that the julia randn implementation can be as fast as the C implementation.

Also, once it is a pure julia implementation, we can try different numbers of Ziggurat levels, and even SIMD types. We may be able to beat most randn implementations out there, but there is no way to find out without attempting to do so.

Our perf benchmark has a pure julia implementation, but it is not a translation of the C implementation that is used internally:

https://github.com/JuliaLang/julia/blob/master/test/perf/kernel/ziggurat.jl
https://github.com/JuliaLang/julia/blob/master/test/perf/kernel/ziggurat.c

I am opening this issue mainly to discuss possible implementation strategies for a fast pure julia randn.

@johnmyleswhite
Copy link
Member

Getting this right would be cool. I've written some (not always correct) hash functions in Julia that might also be nice performance tests at some point.

@JeffBezanson
Copy link
Member

Not necessary.

@ViralBShah
Copy link
Member Author

I think this would be a fun project, even if not a necessary project. Probably does not require an open issue.

@StefanKarpinski
Copy link
Member

Why not have an open up-for-grabs issue? Do we need a separate repo for issues that Jeff doesn't want to have open but that are nice to have around somewhere? This is fairly unlikely for Viral to forget, but an issue is an excellent place to discuss what would be required.

@JeffBezanson
Copy link
Member

I just think this is too minor to put effort into. Of course anybody is welcome to do it, but I'd like to avoid wasting time.

@ViralBShah
Copy link
Member Author

I think that there is potential for a much faster randn than we currently have. That effort is worth the performance gain, if julia can have the fastest randn compared to other software. The reason for this issue was to see if others would be interested.

@StefanKarpinski Is this the kind of thing that a Hacker School student might like to try?

@ViralBShah
Copy link
Member Author

We could have an open issue in Distributions.jl.

@johnmyleswhite
Copy link
Member

Fine by me.

@StefanKarpinski
Copy link
Member

@ViralBShah, where do you see the potential performance gain coming from? Better algorithms? Inlining?

@JeffBezanson
Copy link
Member

If there is a real performance reason, this is fine. I thought it was just
a rewrite.
On Dec 12, 2013 1:32 AM, "Stefan Karpinski" notifications@github.com
wrote:

@ViralBShah https://github.com/ViralBShah, where do you see the
potential performance gain coming from? Better algorithms? Inlining?


Reply to this email directly or view it on GitHubhttps://github.com//issues/5105#issuecomment-30392374
.

@jiahao
Copy link
Member

jiahao commented Dec 12, 2013

Does anyone have experience with mprng?

@ViralBShah
Copy link
Member Author

I think there is potential to experiment with number if ziggurat levels, using ziggurat recursively in the tail, and autotuning for the architecture. These may or may not help. The other main thing to do is to generate multiple random numbers simultaneously using simd types. A translation of the C code would be a waste of time, but I am thinking of a ground up implementation that is faster.

@JeffBezanson JeffBezanson reopened this Dec 15, 2013
@StefanKarpinski
Copy link
Member

If you want to do parallel rng, Random123 seems like the sanest approach.

@ViralBShah
Copy link
Member Author

Yes, when I do attempt this, I would really like to try using Random123.

@ViralBShah ViralBShah added the randomness Random number generation and the Random stdlib label Nov 22, 2014
@ViralBShah
Copy link
Member Author

We have had a nice bump due to the transition from C to Julia. Further improvements will potentially happen from better vectorization in the codegen, but closing this for now.

@ViralBShah
Copy link
Member Author

Well, and the day after, @rfourquet gives us a nice speedup (#9126). Seems like people are doing interesting things in this space, and the potential for SIMD is much better now with some of the stuff we are discussing and recent updates to the compiler.

http://arxiv.org/pdf/1403.6870.pdf

I will reopen this issue.

@ViralBShah ViralBShah reopened this Nov 23, 2014
@ViralBShah
Copy link
Member Author

Another interesting thread at Wilmott:

http://www.wilmott.com/messageview.cfm?catid=44&threadid=95982&STARTPAGE=2&FTVAR_MSGDBTABLE=

In a numpy issue:
numpy/numpy#2047

Also, a paper and code here that talks about performance optimizations, with code for vectorized implementations:
http://www.doornik.com/research.html

@rfourquet
Copy link
Member

I downloaded the source code linked in the arxiv paper you referenced, you can get the perfs on your machine via: gcc -O3 -DNORMAL profile.c -lm -o normal_profiler.out. Running then normal_profiler.out shows that 10^9 normal random numbers are created in 3.5s. The equivalent test in Julia (f(n) = (m=MersenneTwister(); x=0.0; tic(); for i in 1:n; x+=randn(m); end; toc(); x); f(1); f(10^9)) gives me 4.2s, so about 20% slower only.
By curiosity, I ported this code to Julia, and the same test gives 5.1s. I can have made errors, but it's not clear anyway that it could beat the current Julia version.

@ViralBShah
Copy link
Member Author

randn is now only 2.5x slower than rand. Vectorization seems difficult as @ArchRobison explained to me:

The problematic part is going to be the loads wi[idx+1] and ki[idx+1],
which have a random access patterns.  There's an instruction in
Haswell (VGATHER) to deal with this sort of thing, and icc knows how
to use it.  But the story for LLVM is unclear:

· LLVM 3.3 and later have target-specific intrinsics for VGATHER.
  E.g. there is an intrinsic “@llvm.x86.avx2.gather.d.pd”.

· LLVM 3.5 Clang 3.5 doesn't seem to know how to generate VGATHER from
  vanilla LLVM IR.  (I checked by handing Clang a no-brainer
  example.)

· But ISPC is definitely exploiting VGATHER.  I think it may be
  generating calls the intrinsic directly instead of depending
  on the LLVM optimizer.

I think the answer to your question is @simd will not currently
vectorize it because of LLVM limitations, but these limitations might
be fixed in the future.

@damiendr
Copy link
Contributor

Just for the record, since parallel RNGs have been mentionned above...

I've been playing a bit with http://www.pcg-random.org/
It's only a couple lines of code, offers parallel streams and claims to be both fast and good.
So that's another contender in addition to Random123.

Now as for @simd compatibility that is not quite sure. I've been using it with ISPC and the compiler emits a performance warning about the variable right shift.

In addition I have found it very hard to get anything involving unsigned ints to vectorize with @simd (couldn't convince Julia to do unchecked truncations from UInt64 to UInt32 for instance).

@yuyichao
Copy link
Contributor

try uint64_value % UInt32

@ArchRobison
Copy link
Contributor

Is the % UInt32 documented in the manuals? Indeed I had forgotten it and was looking for that functionality on the weekend. Do we have a manual section about numeric conversions? The subject seems to deserve its own treatment since Julia has checked and unchecked conversions. It should also be cross-referenced with the floor/ceil/round since they do numeric conversion.

@StefanKarpinski
Copy link
Member

I believe that @ViralBShah experimented with Random123 and found the performance to be less good than hoped.

@yuyichao
Copy link
Contributor

Not sure about conversion section but last time I checked this method is not documented.

@ViralBShah
Copy link
Member Author

I think it was @andreasnoack

@andreasnoack
Copy link
Member

I made a Julia implementation long time ago, but couldn't get anywhere near the speed they adverticed. I should try with Cxx.jl and see how fast it really is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Indicates that a maintainer wants help on an issue or pull request performance Must go faster randomness Random number generation and the Random stdlib
Projects
None yet
Development

No branches or pull requests

10 participants