Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve prime sieve performance #12025

Merged
merged 1 commit into from
Aug 8, 2015
Merged

Conversation

pabloferz
Copy link
Contributor

This PR changes the algorithm used in primesmask from a Sieve of Atkin to a Sieve of Eratosthenes, based on the idea redesign proposed by @VicDrastik in #11594.

The main ideas used are

  • Most integers are composite, so it is better to set all flags to false initially, then invert for possible primes
  • Except for 2, 3 all primes are of the from 6k ± 1
  • Implement a 30 wheel sieve. This sieve stores a flag for each number of the form 30k + p where p is one of [7, 11, 13, 17, 19, 23, 29, 31] (all primes greater than 5 are of this form).
  • Cross-off multiples of p of the above form, starting from p².
  • Use the wheel sieve _primesmask (not exported) to construct primesmask and to find primes.

Additionally, the function primesmask is now exported and both this function and primes are extended in order to generate all the primes between two integers.

As a follow-up, it would be worth trying to write a general segmented version (perhaps a mix between this and the ideas of @ironman353 in #11594 or this http://primesieve.org/segmented_sieve.html).

I apologize to @Ismael-VC who was also working on this in #11927 (his approach was to fix the current implementation of the Sieve of Atkin, but this should be even faster as suggested in #11594).

Finally, this PR addresses the initial issue on #11594 but that turned out to be a discussion on how to improve the whole support for primes. Maybe that one should be separated into smaller issues and grouped into a master issue.

CC @danaj @aiorla @StefanKarpinski

@pabloferz pabloferz changed the title Improve prime sieve performance WIP: Improve prime sieve performance Jul 6, 2015
@pabloferz
Copy link
Contributor Author

NOTE: This still needs some tests, news and documentation.

@aiorla
Copy link
Contributor

aiorla commented Jul 6, 2015

Nice addition of the mask with a low and high limits. I'm curious about the time/memory performance, to compare it with the Atkin versions or the Eratosthenes one that posted @ironman353 in #11594.

@yuyichao yuyichao added the performance Must go faster label Jul 6, 2015
@Ismael-VC
Copy link
Contributor

This is great! @pabloferz no apologies needed, I'm still learning and this is what I wanted to see in order to learn more, thanks!

@pabloferz
Copy link
Contributor Author

Here are some timings. On my machine (the best out of ten):

# Current julia implementation
julia> @time Base.primesmask(10_000_000);
 129.976 milliseconds (884 k allocations: 15047 KB)

# @VicDrastik's e5 from #11594
julia> @time e5(10_000_000);
  94.491 milliseconds (10378 k allocations: 160 MB, 6.37% gc time)

# @ironman353's version with counting removed to make a fair comparison
julia> @time Eratosthenes_1(10_000_000);
  88.570 milliseconds (9 allocations: 1221 KB)

# @Ismael-VC's proposal
julia> @time atkin(10_000_000);
  65.851 milliseconds (9 allocations: 1221 KB)

# e5, but using loops
julia> @time e5looped(10_000_000);
  45.597 milliseconds (9 allocations: 1221 KB)

# This PR
julia> @time primesmask(10_000_000);
  37.705 milliseconds (10 allocations: 1221 KB)

julia> @time primesmask(1, 10_000_000);
  36.457 milliseconds (42 allocations: 1226 KB)

julia> gc(); @time primesmask(100_000_000, 110_000_000);
  42.236 milliseconds (62 allocations: 1234 KB)

julia> gc(); @time primesmask(1_000_000_000, 1_010_000_000);
  46.094 milliseconds (990 allocations: 1268 KB)

For large numbers this won't be faster than a well implemented segment sieve, for instance @ironman353's full_sieve will be better for larger number of primes. But a proper segmented version can be added later.

@ironman353
Copy link

@pabloferz, for N <= 10^6 you should use Eratosthenes_1 in my code. If N > 10^6 then full_sieve().
Here is an optimized code for sieve of Atkin and the corresponding results:
Pi(10^6) = 78498 [0.076584712 seconds (5848772 bytes allocated)]
Pi(10^7) = 664579 [0.170029603 seconds (15975876 bytes allocated)]
Pi(10^8) = 5761455 [1.197614697 seconds (117223764 bytes allocated, 2.15% gc time)]
Pi(10^9) = 50847534 [14.438797381 seconds (1129725876 bytes allocated, 0.92% gc time)]

function Atkin_3(MAX::Int)
  pi::Int64 = 0;
  SQRT_MAX::Int = floor(sqrt(MAX)) + 1
  isPrime::Array{Bool} = falses(MAX)
  index::Int = 0
  k1::Int64 = 0
  k::Int64 = 0
  xUpper::Float64 = sqrt(MAX / 4) + 1;
  x::Int64 = 1
  y::Int64 = 0
  while x < xUpper
    index = 0
    k1 = 4 * x * x
    y = 1
    if x % 3 == 0
      while true
        k = k1 + y * y
        if k >= MAX
          break
        end
        isPrime[k] = !isPrime[k]
        index += 1
        if index % 2 == 1
          y += 4
        else
          y += 2
        end
      end
    else
      while true
        k = k1 + y * y
        if k >= MAX
          break
        end
        isPrime[k] = !isPrime[k]
        y += 2
      end
    end
    x += 1
  end
  xUpper = sqrt(MAX / 3) + 1
  x = 1
  y = 0
  while x < xUpper
    index = 1
    k1 = 3 * x * x
    y = 2
    while true
      k = k1 + y * y
      if k >= MAX
        break
      end
      isPrime[k] = !isPrime[k]
      index += 1
      if index % 2 == 1
        y += 4
      else
        y += 2
      end
    end
    x += 2
  end
  xUpper = sqrt(MAX) + 1
  x = 1
  y = 0
  while x < xUpper
    k1 = 3 * x * x
    if x % 2 == 0
      y = 1
      index = 0
    else
      y = 2
      index = 1
    end
    while y < x
      k = k1 - y * y
      if k < MAX
        isPrime[k] = !isPrime[k]
      end
      index += 1
      if index % 2 == 1
        y += 4
      else
        y += 2
      end
    end
    x += 1
  end
  isPrime[2] = true
  isPrime[3] = true
  n2::Int = 0
  for n::Int = 5 : SQRT_MAX
    if isPrime[n]
      n2 = n * n
      for j::Int = n2 : n2 : MAX - 1
        isPrime[j] = false
      end
    end
  end
  for i::Int = 2 : MAX - 1
    if isPrime[i]
      pi = pi + 1
    end
  end
  return pi
end

@time Atkin_3(1000000000);

# include("E:/Julia/Atkin_3.jl")

@pabloferz
Copy link
Contributor Author

@ironman353 I'm trying to compare non-segmented sieves. Of course a properly segmented version is going to be faster, but I'm not trying to provide that in this PR (at least not yet).

@ironman353
Copy link

There is a O(n) sieve of Eratosthenes, but it requires extra memory, so reaching 10^9 is difficult. My java implementation takes just 1.712 seconds to generate all primes below 10^8 (single threading) which is way better than the O(n log log n) sieve of Eratosthenes or different versions of sieve of Atkin.
Here is my java code. I could not convert it into julia. I don't know if it has some data-type similar to ArrayList in java or Vector in c++.
Pi(10^7) = 664579 in 0.423 seconds
Pi(10^8) = 5761455 in 1.712 seconds

import java.io.*;
import java.util.*;
import java.math.*;

public class linear_sieve {
    public static int MAX = 100000000;
    public static int SQRT_MAX = (int)Math.floor(Math.sqrt((double)MAX));
    public static int[] x = new int[MAX];
    public static ArrayList<Integer> primes = new ArrayList<Integer>();
    public static void main(String[] args) {
        double start = System.currentTimeMillis();
        long pi = 0;
        for (int i = 2; i < MAX; i++) {
            if (x[i] == 0) {
                x[i] = i;
                primes.add(i);
            }
            for (int j = 0; j < primes.size() && primes.get(j) <= x[i] && i * primes.get(j) < MAX; j++) {
                x[i * primes.get(j)] =  primes.get(j);
            }
        }
        System.out.println("C(" + MAX + ") = " + primes.size());
        double end = System.currentTimeMillis();
        System.out.println("Time elapsed : " + (end - start) / 1000d + " seconds");
    }
}

@mikewl
Copy link
Contributor

mikewl commented Jul 7, 2015

@ironman353 The Julia Array Type is the same as vector and ArrayList.

@pabloferz
Copy link
Contributor Author

@ironman353 That optimized sieve of Atkin is really nice and is faster than what I had. Did you came up with it or did you get the algorithm from somewhere else? If it's yours, would you mind if I rewrite the PR based on your code? If it's not, do you know if we can released it under the MIT license?

@ironman353
Copy link

@pabloferz, I found the algorithm at http://compoasso.free.fr/primelistweb/page/prime/atkin_en.php
I don't know if you can release it under the MIT license. You can contact them via email (at the bootom of the page http://compoasso.free.fr/primelistweb/page/prime/accueil_en.php).
There are better implementations available at that page. As I am new to Julia, I was unable to implement in Julia.

@pabloferz
Copy link
Contributor Author

How can I find what conflicts this PR entails? Would a rebase fix that?

@StefanKarpinski

@KristofferC
Copy link
Member

Rebase on master should fix yes.

@@ -407,6 +407,7 @@ export
prevpow2,
prevprod,
primes,
primesmask,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we want to add this function, but if we do it needs documentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just following the comments in this discussion #11594 (comment)

(Also, I'm aware this would need documentation, I just haven't added it yet)

@pabloferz
Copy link
Contributor Author

I have a segmented version of the function primes, but for it to perform the best it would have to be cache-size dependent. But, I do not know how to get the cache size from within Julia or whether or not such kind of implementation is desirable. Besides that can be done in another PR.

If the proposed changes seem good, I can add the corresponding documentation so this can be merged.

@StefanKarpinski
Copy link
Member

It would be quite nice to provide an API for finding out things like cache sizes.

@pabloferz
Copy link
Contributor Author

👍 to such an API

On the subject of using a BitArray for the primesmask function, here are some timings to serve as reference:

# Current julia implementation

julia> @time Base.primesmask(100_000_000);
   3.046 seconds      (9042 k allocations: 150 MB, 0.25% gc time)

julia> gc(); @time Base.primesmask(1_000_000_000);
  39.573 seconds      (90943 k allocations: 1507 MB, 0.24% gc time)

julia> gc(); @time Base.primes(100_000_000);
   3.348 seconds      (9042 k allocations: 194 MB, 0.26% gc time)

julia> gc(); @time Base.primes(1_000_000_000);
  41.220 seconds      (90943 k allocations: 1895 MB, 0.23% gc time)

# This PR

julia> gc(); @time primesmask(100_000_000);
 327.903 milliseconds (8 allocations: 121 MB, 0.26% gc time)

julia> gc(); @time primesmask(1_000_000_000);
   3.805 seconds      (8 allocations: 1208 MB, 0.04% gc time)

julia> gc(); @time primes(100_000_000);
 405.530 milliseconds (9 allocations: 91578 KB, 0.13% gc time)

julia> gc(); @time primes(1_000_000_000);
   4.520 seconds      (9 allocations: 701 MB, 0.03% gc time)

# Using a BitArray

julia> gc(); @time primesmask(100_000_000);
 454.866 milliseconds (11 allocations: 38249 KB, 0.11% gc time)

julia> gc(); @time primesmask(1_000_000_000);
   5.101 seconds      (11 allocations: 374 MB, 0.02% gc time)

@StefanKarpinski
Copy link
Member

I was looking at this for a good 30 seconds trying to figure out what the %#!@ was going on before I noticed the different units. I really hate that we don't print times in a standard unit anymore.

@KristofferC
Copy link
Member

Amen to that. I keep getting annoyed having to mentally parse the units. Before you just compared two numbers.

@mbauman
Copy link
Member

mbauman commented Jul 29, 2015

There's still some known performance snags for BitArrays involving GC frames, pointer lookup hoists, and user-defined @inbounds (#11403, #11430, #7799). I'm convinced we can eventually get them more competitive in more situations.

In this case, since BitArrays are within 15%, I'd try rewriting your @inbounds indexing with unsafe_getindex and unsafe_setindex!. That might be enough to push BitArrays over the top. That said, a 10x gain is very impressive! Don't let me hold you up!

@pabloferz
Copy link
Contributor Author

With unsafe_setindex! (the implementation does not uses getindex) the timings are these

julia> gc(); @time bitprimesmask(100_000_000);
 389.984 milliseconds (11 allocations: 38249 KB, 0.06% gc time)

julia> gc(); @time bitprimesmask(1_000_000_000);
   4.412 seconds      (11 allocations: 374 MB, 0.02% gc time)

This seems like a good compromise between speed and space. If there are plans on improve BitArrays performance, then there is another reason to use them for the primesmask function.

@tkelman
Copy link
Contributor

tkelman commented Jul 31, 2015

Best to squash out any intermediate commits that would have failed tests.

@pabloferz pabloferz force-pushed the pz/primes branch 3 times, most recently from 4e76ff6 to c82aff1 Compare August 1, 2015 06:37
@pabloferz
Copy link
Contributor Author

I believe this PR is ready, so I leave it for you to review it.

Just as a remark, the use of a BitArray in the primesmask function does not affect the performance of primes.

@pabloferz
Copy link
Contributor Author

Also, this would close #11594, but a lot of the discussion there would have to be moved into at least another issue.

@pabloferz pabloferz force-pushed the pz/primes branch 2 times, most recently from 2a10792 to 844e32f Compare August 3, 2015 07:07
@pabloferz
Copy link
Contributor Author

If I got #11943 right, this must be now using the new doc system and, hopefully, ready.

@tkelman
Copy link
Contributor

tkelman commented Aug 3, 2015

I think so. You can run ./julia doc/genstdlib.jl and see if the rst docs come out the way you'd like.

@pabloferz
Copy link
Contributor Author

This PR, in my opinion, is ready to be merged, unless we need to wait for #12435 to land first.

@tkelman @StefanKarpinski

@pabloferz pabloferz changed the title WIP: Improve prime sieve performance Improve prime sieve performance Aug 6, 2015
@tkelman
Copy link
Contributor

tkelman commented Aug 7, 2015

Is primesmask ending up in the generated RST docs? If not, I think you need to add a signature for it so doc/genstdlib.jl can fill in the docstring.

@pabloferz
Copy link
Contributor Author

@tkelman No, it doesn't show up in the RST docs. Where do I put the signature?

@tkelman
Copy link
Contributor

tkelman commented Aug 7, 2015

Where in the manual do you want this function to be documented?

@pabloferz
Copy link
Contributor Author

Ok, I tested it and it now generates proper RST docs.

@tkelman
Copy link
Contributor

tkelman commented Aug 8, 2015

@StefanKarpinski I'm going to assume you're cool with merging this?

tkelman added a commit that referenced this pull request Aug 8, 2015
Improve prime sieve performance
@tkelman tkelman merged commit ccdfdff into JuliaLang:master Aug 8, 2015
@pabloferz pabloferz deleted the pz/primes branch August 10, 2015 16:07
@pabloferz
Copy link
Contributor Author

#11927 should be closed and the title of #11594 should be modified to something like "Improve performance of the primes module", then this can be an accomplished task in that issue.

@StefanKarpinski
Copy link
Member

Honestly, I've been hesitating because I'm not really qualified to vet this code. But it's faster and seems to still produce correct results, so it's all good.

@Ismael-VC
Copy link
Contributor

@pabloferz great, this is very educational! ✨

@tkelman
Copy link
Contributor

tkelman commented Aug 10, 2015

Neither am I, but it was a well-done PR and I'm inclined to trust @pabloferz here.

I'm not sure this needs to be in base forever, but that can be revisited down the road as part of #5155 and similar.

@pabloferz
Copy link
Contributor Author

I can try to explain the algorithm and prove its correctness somewhere if you feel that would help.

@StefanKarpinski
Copy link
Member

That would be great! It doesn't need to be a formal proof, just a convincing explanation (which is really what a proof is, but you know what I mean).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.