Improve prime sieve performance #12025

pabloferz · 2015-07-06T06:03:02Z

This PR changes the algorithm used in primesmask from a Sieve of Atkin to a Sieve of Eratosthenes, based on the ~~idea~~ redesign proposed by @VicDrastik in #11594.

The main ideas used are

~~Most integers are composite, so it is better to set all flags to false initially, then invert for possible primes~~
~~Except for 2, 3 all primes are of the from 6k ± 1~~
Implement a 30 wheel sieve. This sieve stores a flag for each number of the form 30k + p where p is one of [7, 11, 13, 17, 19, 23, 29, 31] (all primes greater than 5 are of this form).
Cross-off multiples of p of the above form, starting from p².
Use the wheel sieve _primesmask (not exported) to construct primesmask and to find primes.

Additionally, the function primesmask is now exported and both this function and primes are extended in order to generate all the primes between two integers.

As a follow-up, it would be worth trying to write a general segmented version (perhaps a mix between this and the ideas of @ironman353 in #11594 or this http://primesieve.org/segmented_sieve.html).

I apologize to @Ismael-VC who was also working on this in #11927 (his approach was to fix the current implementation of the Sieve of Atkin, but this should be even faster as suggested in #11594).

Finally, this PR addresses the initial issue on #11594 but that turned out to be a discussion on how to improve the whole support for primes. Maybe that one should be separated into smaller issues and grouped into a master issue.

CC @danaj @aiorla @StefanKarpinski

pabloferz · 2015-07-06T06:09:21Z

NOTE: This still needs some tests, news and documentation.

aiorla · 2015-07-06T08:43:59Z

Nice addition of the mask with a low and high limits. I'm curious about the time/memory performance, to compare it with the Atkin versions or the Eratosthenes one that posted @ironman353 in #11594.

Ismael-VC · 2015-07-06T14:05:18Z

This is great! @pabloferz no apologies needed, I'm still learning and this is what I wanted to see in order to learn more, thanks!

pabloferz · 2015-07-06T17:06:21Z

Here are some timings. On my machine (the best out of ten):

# Current julia implementation
julia> @time Base.primesmask(10_000_000);
 129.976 milliseconds (884 k allocations: 15047 KB)

# @VicDrastik's e5 from #11594
julia> @time e5(10_000_000);
  94.491 milliseconds (10378 k allocations: 160 MB, 6.37% gc time)

# @ironman353's version with counting removed to make a fair comparison
julia> @time Eratosthenes_1(10_000_000);
  88.570 milliseconds (9 allocations: 1221 KB)

# @Ismael-VC's proposal
julia> @time atkin(10_000_000);
  65.851 milliseconds (9 allocations: 1221 KB)

# e5, but using loops
julia> @time e5looped(10_000_000);
  45.597 milliseconds (9 allocations: 1221 KB)

# This PR
julia> @time primesmask(10_000_000);
  37.705 milliseconds (10 allocations: 1221 KB)

julia> @time primesmask(1, 10_000_000);
  36.457 milliseconds (42 allocations: 1226 KB)

julia> gc(); @time primesmask(100_000_000, 110_000_000);
  42.236 milliseconds (62 allocations: 1234 KB)

julia> gc(); @time primesmask(1_000_000_000, 1_010_000_000);
  46.094 milliseconds (990 allocations: 1268 KB)

For large numbers this won't be faster than a well implemented segment sieve, for instance @ironman353's full_sieve will be better for larger number of primes. But a proper segmented version can be added later.

ironman353 · 2015-07-07T05:00:27Z

@pabloferz, for N <= 10^6 you should use Eratosthenes_1 in my code. If N > 10^6 then full_sieve().
Here is an optimized code for sieve of Atkin and the corresponding results:
Pi(10^6) = 78498 [0.076584712 seconds (5848772 bytes allocated)]
Pi(10^7) = 664579 [0.170029603 seconds (15975876 bytes allocated)]
Pi(10^8) = 5761455 [1.197614697 seconds (117223764 bytes allocated, 2.15% gc time)]
Pi(10^9) = 50847534 [14.438797381 seconds (1129725876 bytes allocated, 0.92% gc time)]

function Atkin_3(MAX::Int)
  pi::Int64 = 0;
  SQRT_MAX::Int = floor(sqrt(MAX)) + 1
  isPrime::Array{Bool} = falses(MAX)
  index::Int = 0
  k1::Int64 = 0
  k::Int64 = 0
  xUpper::Float64 = sqrt(MAX / 4) + 1;
  x::Int64 = 1
  y::Int64 = 0
  while x < xUpper
    index = 0
    k1 = 4 * x * x
    y = 1
    if x % 3 == 0
      while true
        k = k1 + y * y
        if k >= MAX
          break
        end
        isPrime[k] = !isPrime[k]
        index += 1
        if index % 2 == 1
          y += 4
        else
          y += 2
        end
      end
    else
      while true
        k = k1 + y * y
        if k >= MAX
          break
        end
        isPrime[k] = !isPrime[k]
        y += 2
      end
    end
    x += 1
  end
  xUpper = sqrt(MAX / 3) + 1
  x = 1
  y = 0
  while x < xUpper
    index = 1
    k1 = 3 * x * x
    y = 2
    while true
      k = k1 + y * y
      if k >= MAX
        break
      end
      isPrime[k] = !isPrime[k]
      index += 1
      if index % 2 == 1
        y += 4
      else
        y += 2
      end
    end
    x += 2
  end
  xUpper = sqrt(MAX) + 1
  x = 1
  y = 0
  while x < xUpper
    k1 = 3 * x * x
    if x % 2 == 0
      y = 1
      index = 0
    else
      y = 2
      index = 1
    end
    while y < x
      k = k1 - y * y
      if k < MAX
        isPrime[k] = !isPrime[k]
      end
      index += 1
      if index % 2 == 1
        y += 4
      else
        y += 2
      end
    end
    x += 1
  end
  isPrime[2] = true
  isPrime[3] = true
  n2::Int = 0
  for n::Int = 5 : SQRT_MAX
    if isPrime[n]
      n2 = n * n
      for j::Int = n2 : n2 : MAX - 1
        isPrime[j] = false
      end
    end
  end
  for i::Int = 2 : MAX - 1
    if isPrime[i]
      pi = pi + 1
    end
  end
  return pi
end

@time Atkin_3(1000000000);

# include("E:/Julia/Atkin_3.jl")

pabloferz · 2015-07-07T05:07:54Z

@ironman353 I'm trying to compare non-segmented sieves. Of course a properly segmented version is going to be faster, but I'm not trying to provide that in this PR (at least not yet).

ironman353 · 2015-07-07T11:06:30Z

There is a O(n) sieve of Eratosthenes, but it requires extra memory, so reaching 10^9 is difficult. My java implementation takes just 1.712 seconds to generate all primes below 10^8 (single threading) which is way better than the O(n log log n) sieve of Eratosthenes or different versions of sieve of Atkin.
Here is my java code. I could not convert it into julia. I don't know if it has some data-type similar to ArrayList in java or Vector in c++.
Pi(10^7) = 664579 in 0.423 seconds
Pi(10^8) = 5761455 in 1.712 seconds

import java.io.*;
import java.util.*;
import java.math.*;

public class linear_sieve {
    public static int MAX = 100000000;
    public static int SQRT_MAX = (int)Math.floor(Math.sqrt((double)MAX));
    public static int[] x = new int[MAX];
    public static ArrayList<Integer> primes = new ArrayList<Integer>();
    public static void main(String[] args) {
        double start = System.currentTimeMillis();
        long pi = 0;
        for (int i = 2; i < MAX; i++) {
            if (x[i] == 0) {
                x[i] = i;
                primes.add(i);
            }
            for (int j = 0; j < primes.size() && primes.get(j) <= x[i] && i * primes.get(j) < MAX; j++) {
                x[i * primes.get(j)] =  primes.get(j);
            }
        }
        System.out.println("C(" + MAX + ") = " + primes.size());
        double end = System.currentTimeMillis();
        System.out.println("Time elapsed : " + (end - start) / 1000d + " seconds");
    }
}

mikewl · 2015-07-07T14:16:47Z

@ironman353 The Julia Array Type is the same as vector and ArrayList.

pabloferz · 2015-07-11T14:50:31Z

@ironman353 That optimized sieve of Atkin is really nice and is faster than what I had. Did you came up with it or did you get the algorithm from somewhere else? If it's yours, would you mind if I rewrite the PR based on your code? If it's not, do you know if we can released it under the MIT license?

ironman353 · 2015-07-13T03:31:31Z

@pabloferz, I found the algorithm at http://compoasso.free.fr/primelistweb/page/prime/atkin_en.php
I don't know if you can release it under the MIT license. You can contact them via email (at the bootom of the page http://compoasso.free.fr/primelistweb/page/prime/accueil_en.php).
There are better implementations available at that page. As I am new to Julia, I was unable to implement in Julia.

pabloferz · 2015-07-27T16:38:01Z

How can I find what conflicts this PR entails? Would a rebase fix that?

@StefanKarpinski

KristofferC · 2015-07-27T17:26:13Z

Rebase on master should fix yes.

JeffBezanson · 2015-07-27T18:36:28Z

base/exports.jl

@@ -407,6 +407,7 @@ export
    prevpow2,
    prevprod,
    primes,
+    primesmask,


I'm not sure we want to add this function, but if we do it needs documentation.

I was just following the comments in this discussion #11594 (comment)

(Also, I'm aware this would need documentation, I just haven't added it yet)

pabloferz · 2015-07-29T19:00:06Z

I have a segmented version of the function primes, but for it to perform the best it would have to be cache-size dependent. But, I do not know how to get the cache size from within Julia or whether or not such kind of implementation is desirable. Besides that can be done in another PR.

If the proposed changes seem good, I can add the corresponding documentation so this can be merged.

StefanKarpinski · 2015-07-29T19:18:08Z

It would be quite nice to provide an API for finding out things like cache sizes.

pabloferz · 2015-07-29T19:48:19Z

👍 to such an API

On the subject of using a BitArray for the primesmask function, here are some timings to serve as reference:

# Current julia implementation

julia> @time Base.primesmask(100_000_000);
   3.046 seconds      (9042 k allocations: 150 MB, 0.25% gc time)

julia> gc(); @time Base.primesmask(1_000_000_000);
  39.573 seconds      (90943 k allocations: 1507 MB, 0.24% gc time)

julia> gc(); @time Base.primes(100_000_000);
   3.348 seconds      (9042 k allocations: 194 MB, 0.26% gc time)

julia> gc(); @time Base.primes(1_000_000_000);
  41.220 seconds      (90943 k allocations: 1895 MB, 0.23% gc time)

# This PR

julia> gc(); @time primesmask(100_000_000);
 327.903 milliseconds (8 allocations: 121 MB, 0.26% gc time)

julia> gc(); @time primesmask(1_000_000_000);
   3.805 seconds      (8 allocations: 1208 MB, 0.04% gc time)

julia> gc(); @time primes(100_000_000);
 405.530 milliseconds (9 allocations: 91578 KB, 0.13% gc time)

julia> gc(); @time primes(1_000_000_000);
   4.520 seconds      (9 allocations: 701 MB, 0.03% gc time)

# Using a BitArray

julia> gc(); @time primesmask(100_000_000);
 454.866 milliseconds (11 allocations: 38249 KB, 0.11% gc time)

julia> gc(); @time primesmask(1_000_000_000);
   5.101 seconds      (11 allocations: 374 MB, 0.02% gc time)

StefanKarpinski · 2015-07-29T19:56:47Z

I was looking at this for a good 30 seconds trying to figure out what the %#!@ was going on before I noticed the different units. I really hate that we don't print times in a standard unit anymore.

KristofferC · 2015-07-29T20:03:40Z

Amen to that. I keep getting annoyed having to mentally parse the units. Before you just compared two numbers.

mbauman · 2015-07-29T20:14:33Z

There's still some known performance snags for BitArrays involving GC frames, pointer lookup hoists, and user-defined @inbounds (#11403, #11430, #7799). I'm convinced we can eventually get them more competitive in more situations.

In this case, since BitArrays are within 15%, I'd try rewriting your @inbounds indexing with unsafe_getindex and unsafe_setindex!. That might be enough to push BitArrays over the top. That said, a 10x gain is very impressive! Don't let me hold you up!

pabloferz · 2015-07-29T22:27:29Z

With unsafe_setindex! (the implementation does not uses getindex) the timings are these

julia> gc(); @time bitprimesmask(100_000_000);
 389.984 milliseconds (11 allocations: 38249 KB, 0.06% gc time)

julia> gc(); @time bitprimesmask(1_000_000_000);
   4.412 seconds      (11 allocations: 374 MB, 0.02% gc time)

This seems like a good compromise between speed and space. If there are plans on improve BitArrays performance, then there is another reason to use them for the primesmask function.

tkelman · 2015-07-31T06:45:19Z

Best to squash out any intermediate commits that would have failed tests.

pabloferz · 2015-08-01T13:16:49Z

I believe this PR is ready, so I leave it for you to review it.

Just as a remark, the use of a BitArray in the primesmask function does not affect the performance of primes.

pabloferz · 2015-08-02T02:00:26Z

Also, this would close #11594, but a lot of the discussion there would have to be moved into at least another issue.

pabloferz · 2015-08-03T07:10:16Z

If I got #11943 right, this must be now using the new doc system and, hopefully, ready.

tkelman · 2015-08-03T15:07:54Z

I think so. You can run ./julia doc/genstdlib.jl and see if the rst docs come out the way you'd like.

pabloferz · 2015-08-04T16:47:03Z

This PR, in my opinion, is ready to be merged, unless we need to wait for #12435 to land first.

@tkelman @StefanKarpinski

tkelman · 2015-08-07T11:21:49Z

Is primesmask ending up in the generated RST docs? If not, I think you need to add a signature for it so doc/genstdlib.jl can fill in the docstring.

pabloferz · 2015-08-07T14:25:00Z

@tkelman No, it doesn't show up in the RST docs. Where do I put the signature?

tkelman · 2015-08-07T14:25:54Z

Where in the manual do you want this function to be documented?

pabloferz · 2015-08-08T00:18:13Z

Ok, I tested it and it now generates proper RST docs.

tkelman · 2015-08-08T02:25:36Z

@StefanKarpinski I'm going to assume you're cool with merging this?

Improve prime sieve performance

pabloferz · 2015-08-10T16:10:37Z

#11927 should be closed and the title of #11594 should be modified to something like "Improve performance of the primes module", then this can be an accomplished task in that issue.

StefanKarpinski · 2015-08-10T17:05:46Z

Honestly, I've been hesitating because I'm not really qualified to vet this code. But it's faster and seems to still produce correct results, so it's all good.

Ismael-VC · 2015-08-10T20:41:40Z

@pabloferz great, this is very educational! ✨

tkelman · 2015-08-10T23:46:36Z

Neither am I, but it was a well-done PR and I'm inclined to trust @pabloferz here.

I'm not sure this needs to be in base forever, but that can be revisited down the road as part of #5155 and similar.

pabloferz · 2015-08-11T20:28:23Z

I can try to explain the algorithm and prove its correctness somewhere if you feel that would help.

StefanKarpinski · 2015-08-11T20:29:58Z

That would be great! It doesn't need to be a formal proof, just a convincing explanation (which is really what a proof is, but you know what I mean).

pabloferz changed the title ~~Improve prime sieve performance~~ WIP: Improve prime sieve performance Jul 6, 2015

yuyichao added the performance Must go faster label Jul 6, 2015

pabloferz mentioned this pull request Jul 16, 2015

improve prime sieve performance #11594

Closed

JeffBezanson reviewed Jul 27, 2015
View reviewed changes

pabloferz force-pushed the pz/primes branch from d16cd04 to eb9d8dc Compare July 28, 2015 06:37

pabloferz force-pushed the pz/primes branch 3 times, most recently from 4e76ff6 to c82aff1 Compare August 1, 2015 06:37

tkelman mentioned this pull request Aug 2, 2015

Doc hard, and with a vengeance #11943

Merged

pabloferz force-pushed the pz/primes branch 2 times, most recently from 2a10792 to 844e32f Compare August 3, 2015 07:07

pabloferz mentioned this pull request Aug 5, 2015

WIP: Improve primes performance. #11927

Closed

pabloferz changed the title ~~WIP: Improve prime sieve performance~~ Improve prime sieve performance Aug 6, 2015

Use a wheel sieve to improve performance and export primesmask.

f50da30

pabloferz force-pushed the pz/primes branch from f6f7c12 to f50da30 Compare August 7, 2015 19:26

tkelman added a commit that referenced this pull request Aug 8, 2015

Merge pull request #12025 from pabloferz/pz/primes

ccdfdff

Improve prime sieve performance

tkelman merged commit ccdfdff into JuliaLang:master Aug 8, 2015

pabloferz deleted the pz/primes branch August 10, 2015 16:07

Improve prime sieve performance #12025

Improve prime sieve performance #12025

Conversation

pabloferz commented Jul 6, 2015

pabloferz commented Jul 6, 2015

aiorla commented Jul 6, 2015

Ismael-VC commented Jul 6, 2015

pabloferz commented Jul 6, 2015

ironman353 commented Jul 7, 2015

pabloferz commented Jul 7, 2015

ironman353 commented Jul 7, 2015

mikewl commented Jul 7, 2015

pabloferz commented Jul 11, 2015

ironman353 commented Jul 13, 2015

pabloferz commented Jul 27, 2015

KristofferC commented Jul 27, 2015

JeffBezanson Jul 27, 2015

Choose a reason for hiding this comment

pabloferz Jul 27, 2015

Choose a reason for hiding this comment

pabloferz commented Jul 29, 2015

StefanKarpinski commented Jul 29, 2015

pabloferz commented Jul 29, 2015

StefanKarpinski commented Jul 29, 2015

KristofferC commented Jul 29, 2015

mbauman commented Jul 29, 2015

pabloferz commented Jul 29, 2015

tkelman commented Jul 31, 2015

pabloferz commented Aug 1, 2015

pabloferz commented Aug 2, 2015

pabloferz commented Aug 3, 2015

tkelman commented Aug 3, 2015

pabloferz commented Aug 4, 2015

tkelman commented Aug 7, 2015

pabloferz commented Aug 7, 2015

tkelman commented Aug 7, 2015

pabloferz commented Aug 8, 2015

tkelman commented Aug 8, 2015

pabloferz commented Aug 10, 2015

StefanKarpinski commented Aug 10, 2015

Ismael-VC commented Aug 10, 2015

tkelman commented Aug 10, 2015

pabloferz commented Aug 11, 2015

StefanKarpinski commented Aug 11, 2015